Page 1
MP(3C)MP(3C)NAME
mp: mp_block, mp_blocktime, mp_create, mp_destroy, mp_my_threadnum,
mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock,
mp_suggested_numthreads, mp_unsetlock, mp_barrier, mp_in_doacross_loop,
mp_set_slave_stacksize - C multiprocessing utility functions
SYNOPSIS
void mp_block()
void mp_unblock()
void mp_blocktime(iters)
int iters
void mp_setup()
void mp_create(num)
int num
void mp_destroy()
int mp_numthreads()
void mp_set_numthreads(num)
int num
int mp_my_threadnum()
int mp_is_master()
void mp_setlock()
void mp_unsetlock()
void mp_barrier()
int mp_in_doacross_loop()
void mp_set_slave_stacksize(size)
int size
unsigned int mp_suggested_numthreads(num)
unsigned int num
DESCRIPTION
These routines give some measure of control over the parallelism used in
C programs. They should not be needed by most users, but will help to
tune specific applications.
Page 2
MP(3C)MP(3C)
mp_block puts all slave threads to sleep via blockproc(2). This frees
the processors for use by other jobs. This is useful if it is known that
the slaves will not be needed for some time, and the machine is being
shared by several users. Calls to mp_block may not be nested; a warning
is issued if an attempt to do so is made.
mp_unblock wakes up the slave threads that were previously blocked via
mp_block. It is an error to unblock threads that are not currently
blocked; a warning is issued if an attempt is made to do so.
It is not necessary to explicitly call mp_unblock. When a parallel
region is entered, a check is made, and if the slaves are currently
blocked, a call is made to mp_unblock automatically.
mp_blocktime controls the amount of time a slave thread waits for work
before giving up. When enough time has elapsed, the slave thread blocks
itself. This automatic blocking is independent of the user level
blocking provided by the mp_block/mp_unblock calls. Slave threads that
have blocked themselves will be automatically unblocked upon entering a
parallel region. The argument to mp_blocktime is the number of times to
spin in the wait loop. By default, it is set to 10,000,000. This takes
about .25 seconds on a 200MHz processor. As a special case, an argument
of 0 disables the automatic blocking, and the slaves will spin wait
without limit. The environment variable MP_BLOCKTIME may be set to an
integer value. It acts like an implicit call to mp_blocktime during
program startup.
mp_destroy deletes the slave threads. They are stopped by forcing them
to call exit(2). In general, doing this is discouraged. mp_block can be
used in most cases.
mp_create creates and initializes threads. It creates enough threads so
that the total number is equal to the argument. Since the calling thread
already counts as one, mp_create will create one less than its argument
in new slave threads.
mp_setup also creates and initializes threads. It takes no arguments.
It simply calls mp_create using the current default number of threads.
Normally the default number is equal to the number of cpu's currently on
the machine. If the user has not called either of the thread creation
routines already, then mp_setup is invoked automatically when the first
parallel region is entered. If the environment variable MP_SETUP is set,
then mp_setup is called during initialization, before any user code is
executed.
mp_numthreads returns the number of threads that would participate in an
immediately following parallel region. If the threads have already been
created, then it returns the current number of threads. If the threads
have not been created, then it returns the current default number of
threads. The count includes the master thread. Knowing this count can be
useful in optimizing certain kinds of parallel loops by hand, but this
function has the side-effect of freezing the number of threads to the
Page 3
MP(3C)MP(3C)
returned value. As a result, this routine should be used sparingly. To
determine the number of threads without this side-effect, see the
description of mp_suggested_numthreads below.
mp_set_numthreads sets the current default number of threads to the
specified value. Note that this call does not directly create the
threads, it only specifies the number that a subsequent mp_setup call
should use. If the environment variable MP_SET_NUMTHREADS is set, it
acts like an implicit call to mp_set_numthreads during program startup.
For convenience when operating among several machines with different
numbers of cpus, MP_SET_NUMTHREADS may be set to an expression involving
integer literals, the binary operators + and -, the binary functions min
and max, and the special symbolic value ALL which stands for "the total
number of available cpus on the current machine." Thus, something simple
like
setenv MP_SET_NUMTHREADS 7
would set the number of threads to seven. This may be a fine choice on
an 8 cpu machine, but would be very bad on a 4 cpu machine. Instead, use
something like
setenv MP_SET_NUMTHREADS "max(1,all-1)"
which sets the number of threads to be one less than the number of cpus
on the current machine (but always at least one). If your configuration
includes some machines with large numbers of cpus, setting an upper bound
is a good idea. Something like:
setenv MP_SET_NUMTHREADS "min(all,4)"
will request (no more than) 4 cpus.
For compatibility with earlier releases, NUM_THREADS is supported as a
synonym for MP_SET_NUMTHREADS.
mp_my_threadnum returns an integer between 0 and n-1 where n is the value
returned by mp_numthreads. The master process is always thread 0. This
is occasionally useful for optimizing certain kinds of loops by hand.
mp_is_master returns 1 if called by the master process, 0 otherwise.
mp_setlock provides convenient (though limited) access to the locking
routines. The convenience is that no set up need be done; it may be
called directly without any preliminaries. The limitation is that there
is only one lock. It is analogous to the ussetlock(3P) routine, but it
takes no arguments and does not return a value. This is useful for
serializing access to shared variables (e.g. counters) in a parallel
region. Note that it will frequently be necessary to declare those
variables as volatile to ensure that the optimizer does not assign them
to a register.
mp_unsetlock is the companion routine for mp_setlock. It also takes no
arguments and does not return a value.
mp_barrier provides a simple interface to a single barrier(3P). It may
be used inside a parallel loop to force a barrier synchronization to
occur among the parallel threads. The routine takes no arguments,
Page 4
MP(3C)MP(3C)
returns no value, and does not require any initialization.
mp_in_doacross_loop answers the question "am I currently executing inside
a parallel loop." This is needful in certain rare situations where you
have an external routine that can be called both from inside a parallel
loop and also from outside a parallel loop, and the routine must do
different things depending on whether it is being called in parallel or
not.
mp_set_slave_stacksize sets the stacksize (in bytes) to be used by the
slave processes when they are created (via sprocsp(2)). The default size
is 16MB. Note that slave processes only allocate their local data onto
their stack, shared data (even if allocated on the master's stack) is not
counted.
mp_suggested_numthreads uses the supplied value as a hint about how many
threads to use in subsequent parallel regions, and returns the previous
value of the number of threads to be employed in parallel regions. It
does not affect currently executing parallel regions, if any. The
implementation may ignore this hint depending on factors such as overall
system load. This routine may also be called with the value 0, in which
case it simply returns the number of threads to be employed in parallel
regions without the side-effect present in mp_numthreads.
Pragmas or directives
The MIPSpro C (and C++) compiler allows you to apply the capabilities of
a Silicon Graphics multiprocessor computer to the execution of a single
job. By coding a few simple directives, the compiler splits the job into
concurrently executing pieces, thereby decreasing the wall-clock run time
of the job.
Directives enable, disable, or modify a feature of the compiler.
Essentially, directives are command line options specified within the
input file instead of on the command line. Unlike command line options,
directives have no default setting. To invoke a directive, you must
either toggle it on or set a desired value for its level. The following
directives can be used in C (and C++) programs when compiled with the -mp
option.
#pragma parallel
This pragma denotes the start of a parallel region. The syntax for
this pragma has a number of modifiers, but to run a single loop in
parallel, the only modifiers you usually use are shared, and local.
These options tell the multiprocessing compiler which variables to
share between all threads of execution and which variables should be
treated as local.
In C, the code that comprises the parallel region is delimited by
curly braces ({ }) and immediately follows the parallel pragma and
Page 5
MP(3C)MP(3C)
its modifiers.
The syntax for this pragma is:
#pragma parallel shared (variables)
#pragma local (variables) optional modifiers
{code}
The parallel pragma has four modifiers: shared, local, if, and
numthreads.
Their definitions ares:
shared ( variable_names )
Tells the multiprocessing C compiler the names of all the
variables that the threads must share.
local ( variable_names )
Tells the multiprocessing C compiler the names of all the
variables that must be private to each thread. (When PCA sets up
a parallel region, it does this for you.)
if ( integer_valued_expr )
Lets you set up a condition that is evaluated at run time to
determine whether or not to run the statement(s) serially or in
parallel. At compile time, it is not always possible to judge how
much work a parallel region does (for example, loop indices are
often calculated from data supplied at run time). Avoid running
trivial amounts of code in parallel because you cannot make up
the overhead associated with running code in parallel. PCA will
also generate this condition as appropriate. If the if condition
is false (equal to zero), then the statement(s) runs serially.
Otherwise, the statement(s) run in parallel.
numthreads(expr)
Tells the multiprocessing C compiler the number of available
threads to use when running this region in parallel. (The default
is all the available threads.)
In general, you should never have more threads of execution than
you have processors, and you should specify numthreads with the
MP_SET_NUMTHREADS environmental variable at run time If you want
to run a loop in parallel while you run some other code, you can
use this option to tell the multiprocessing C compiler to use
only some of the available threads.
The expression expr should evaluate to a positive integer.
Page 6
MP(3C)MP(3C)
For example, to start a parallel region in which to run the
following code in parallel:
for (idx=n; idx; idx--) {
a[idx] = b[idx] + c[idx];
}
you must write:
#pragma parallel shared( a, b, c ) shared(n) local( idx )
or:
#pragma parallel
#pragma shared( a, b, c )
#pragma shared(n)
#pragma local(idx)
before the statement or compound statement (code in curly braces,
{ }) that comprises the parallel region.
Any code within a parallel region but not within any of the
explicit parallel constructs ( pfor, independent, one processor,
and critical ) is termed local code. Local code typically
modifies only local data and is run by all threads.
#pragma pfor
The pfor is contained within a parallel region. Use #pragma pfor to
run a for loop in parallel only if the loop meets all of these
conditions:
All the values of the index variable can be computed
independently of the iterations.
All iterations are independent of each other - that is, data used
in one iteration does not depend on data created by another
iteration. A quick test for independence: if the loop can be run
backwards, then chances are good the iterations are independent.
The loop control variable cannot be a field within a
class/struct/union or an array element.
The number of times the loop must be executed is determined once,
upon entry to the loop, and is based on the loop initialization,
loop test, and loop increment statements.
Page 7
MP(3C)MP(3C)
If the number of times the loop is actually executed is different
from what is computed above, the results are unpredictable. This
can happen if the loop test and increment change during the
execution of the loop, or if there is an early exit from within
the for loop. An early exit or a change to the loop test and
increment during execution may have serious performance
implications.
The test or the increment should not contain expressions with
side effects.
The chunksize, if specified, is computed before the loop is
executed, and the behavior is unpredictable if its value changes
within the loop.
If you are writing a pfor loop for the multiprocessing C++
compiler, the index variable i can be declared within the for
statement via
int i = 0;
The draft for the C++ standard states that the scope of the index
variable declared in a for statement extends to the end of the
for statement, as in this example:
#pragma pfor for (int i = 0, ...)
The C++ compiler doesn't enforce this; in fact, with this
compiler the scope extends to the end of the enclosing block. Use
care when writing code so that the subsequent change in scope
rules for i (in later compiler releases) do not affect the user
code.
If the code after a pfor is not dependent on the calculations made in
the pfor loop, there is no reason to synchronize the threads of
execution before they continue. So, if one thread from the pfor
finishes early, it can go on to execute the serial code without
waiting for the other threads to finish their part of the loop.
The #pragma pfor directive takes several modifiers; the only one that
is required is iterate. #pragma pfor tells the compiler that each
iteration of the loop is unique. It also partitions the iterations
among the threads for execution.
The syntax for #pragma pfor is:
#pragma pfor iterate ( ) optional_modifiers
for ...
{ code ... }
The pfor pragma has several modifiers. Their syntax is:
Page 8
MP(3C)MP(3C)
iterate (index variable=expr1; expr2; expr3 )
local(variable list)
lastlocal (variable list)
reduction (variable list)
affinity (variable) = thread (expression)
schedtype (type)
chunksize (expr)
Where:
iterate (index variable=expr1; expr2; expr3 )
Gives the multiprocessing C compiler the information it needs to
identify the unique iterations of the loop and partition them to
particular threads of execution.
index variable is the index variable of the for loop you want
to run in parallel.
expr1 is the starting value for the loop index.
expr2 is the number of iterations for the loop you want to
run in parallel.
expr3 is the increment of the for loop you want to run in
parallel.
local (variable list)
Specifies variables that are local to each process. If a variable
is declared as local, each iteration of the loop is given its own
uninitialized copy of the variable. You can declare a variable as
local if its value does not depend on any other iteration of the
loop and if its value is used only within a single iteration. In
effect the local variable is just temporary; a new copy can be
created in each loop iteration without changing the final answer.
lastlocal (variable list)
Specifies variables that are local to each process. Unlike with
the local clause, the compiler saves only the value of the
logically last iteration of the loop when it exits.
reduction (variable list)
Specifies variables involved in a reduction operation. In a
reduction operation, the compiler keeps local copies of the
variables and combines them when it exits the loop. An element of
the reduction list must be an individual variable (also called a
scalar variable) and cannot be an array or struct. However, it
can be an individual element of an array. When the reduction
modifier is used, it appears in the list with the correct
Page 9
MP(3C)MP(3C)
subscripts.
One element of an array can be used in a reduction operation,
while other elements of the array are used in other ways. To
allow for this, if an element of an array appears in the
reduction list, the entire array can also appear in the share
list.
The two types of reductions supported are sum(+) and product(*).
The compiler confirms that the reduction expression is legal by
making some simple checks. The compiler does not, however, check
all statements in the do loop for illegal reductions. You must
ensure that the reduction variable is used correctly in a
reduction operation.
affinity (variable) = thread (expression)
The effect of thread-affinity is to execute iteration "i" on the
thread number given by the user-supplied expression (modulo the
number of threads). Since the threads may need to evaluate this
expression in each iteration of the loop, the variables used in
the expression (other than the loop induction variable) must be
declared shared and must not be modified during the execution of
the loop. Violating these rules may lead to incorrect results.
If the expression does not depend on the loop induction variable,
then all iterations will execute on the same thread, and will not
benefit from parallel execution.
schedtype (type)
Tells the multiprocessing C compiler how to share the loop
iterations among the processors. The schedtype chosen depends on
the type of system you are using and the number of programs
executing. You can use the following valid types to modify
schedtype:
simple (the default)
tells the run time scheduler to partition the iterations
evenly among all the available threads.
runtime
Tells the compiler that the real schedule type will be
specified at run time.
dynamic
Tells the run time scheduler to give each thread chunksize
iterations of the loop. chunksize should be smaller than
Page 10
MP(3C)MP(3C)
(number of total iterations)/(number of threads). The
advantage of dynamic over simple is that dynamic helps
distribute the work more evenly than simple.
Depending on the data, some iterations of a loop can take
longer to compute than others, so some threads may finish
long before the others. In this situation, if the iterations
are distributed by simple, then the thread waits for the
others. But if the iterations are distributed by dynamic, the
thread doesn't wait, but goes back to get another chunksize
iteration until the threads of execution have run all the
iterations of the loop.
interleave
Tells the run time scheduler to give each thread chunksize
iterations (described below) of the loop, which are then
assigned to the threads in an interleaved way.
gss (guided self-scheduling)
Tells the run time scheduler to give each processor a varied
number of iterations of the loop. This is like dynamic, but
instead of a fixed chunksize, the chunk size iterations begin
with big pieces and end with small pieces.
If I iterations remain and P threads are working on them, the
piece size is roughly: I/(2P) + 1
Programs with triangular matrices should use gss.
chunksize (expr)
Tells the multiprocessing C/C++ compiler how many iterations
to define as a chunk when you use the dynamic or interleave
modifier (described above).
expr should be positive integer, and should evaluate to the
following formula:
number of iterations / X
where X is between twice and ten times the number of threads.
Select twice the number of threads when iterations vary
slightly. Reduce the chunk size to reflect the increasing
variance in the iterations. Performance gains may diminish
after increasing X to ten times the number of threads.
Page 11
MP(3C)MP(3C)
#pragma one processor
A #pragma one processor directive causes the statement that follows
it to be executed by exactly one thread.
The syntax of this pragma is:
#pragma one processor
{ code }
#pragma critical
Sometimes the bulk of the work done by a loop can be done in
parallel, but the entire loop cannot run in parallel because of a
single data-dependent statement. Often, you can move such a statement
out of the parallel region. When that is not possible, you can
sometimes use a lock on the statement to preserve the integrity of
the data.
In the multiprocessing C/C++ compiler, use the critical pragma to put
a lock on a critical statement (or compound statement using { }).
When you put a lock on a statement, only one thread at a time can
execute that statement. If one thread is already working on a
critical protected statement, any other thread that wants to execute
that statement must wait until that thread has finished executing it.
The syntax of the critical pragma is:
#pragma critical (lock_variable)
{ code }
The statement(s) after the critical pragma will be executed by all
threads, one at a time. The lock variable lock_variable is an
optional integer variable that must be initialized to zero. The
parentheses are required. If you don't specify a lock variable, the
compiler automatically supplies one. Multiple critical constructs
inside the same parallel region are considered to be independent of
each other unless they use the same explicit lock variable.
#pragma independent
Running a loop in parallel is a class of parallelism sometimes called
fine-grained parallelism or homogeneous parallelism. It is called
homogeneous because all the threads execute the same code on
different data. Another class of parallelism is called coarse-
grained parallelism or heterogeneous parallelism. As the name
suggests, the code in each thread of execution is different.
Page 12
MP(3C)MP(3C)
Ensuring data independence for heterogeneous code executed in
parallel is not always as easy as it is for homogeneous code executed
in parallel. (Ensuring data independence for homogeneous code is not
a trivial task.)
The independent pragma has no modifiers. Use this pragma to tell the
multiprocessing C/C++ compiler to run code in parallel with the rest
of the code in the parallel region.
The syntax for #pragma independent is:
#pragma independent
{ code }
Synchronization Directives
To account for data dependencies, it is sometimes necessary for threads
to wait for all other threads to complete executing an earlier section of
code. Two sets of directives implement this coordination: #pragma
synchronize and #pragma enter/exit gate.
#pragma synchronize
A #pragma synchronize tells the multiprocessing C/C++ compiler that
within a parallel region, no thread can execute the statements that
follows this pragma until all threads have reached it. This
directive is a classic barrier construct.
The syntax for this pragma is:
#pragma synchronize
#pragma enter gate
#pragma exit gate
You can use two additional pragmas to coordinate the processing of
code within a parallel region. These additional pragmas work as a
matched set. They are #pragma enter gate and #pragma exit gate.
A gate is a special barrier. No thread can exit the gate until all
threads have entered it. This construct gives you more flexibility
when managing dependencies between the work-sharing constructs
within a parallel region.
The syntax of the enter gate pragma is:
Page 13
MP(3C)MP(3C)
#pragma enter gate
For example, construct D may be dependent on construct A, and
construct F may be dependent on construct B. However, you do not
want to stop at construct D because all the threads have not cleared
B. By using enter/exit gate pairs, you can make subtle distinctions
about which construct is dependent on which other construct.
Put this pragma after the work-sharing construct that all threads
must clear before the #pragma exit gate of the same name.
The syntax of the exit gate pragma is:
#pragma exit gate
Put this pragma before the work-sharing construct that is dependent
on the preceding #pragma enter gate. No thread enters this work-
sharing construct until all threads have cleared the work-sharing
construct controlled by the corresponding #pragma enter gate.
#pragma page_place
The syntax of this pragma is:
#pragma page_place (addr, size, threadnum)
where addr is the starting address, size is the size in bytes, and
threadnum is the thread.
On a system with physically distributed shared memory, for example,
Origin2000), you can explicitly place all data pages spanned by the
virtual address range [addr, addr + size-1] in the physical memory
of the processor corresponding to the specified thread.
SEE ALSOcc(1), f77(1), mp(3f), sync(3c), sync(3f), MIPSpro Power C Programmer's
Guide, MIPSpro C Language Reference Manual, MIPSpro FORTRAN 77
Programmer's Guide
Page 14