DPSLDU(3S)DPSLDU(3S)NAME
DPSLDU_Destroy, DPSLDU_ExtractPerm, DPSLDU_Factor, DPSLDU_FactorOOC,
DPSLDU_OOCLimit, DPSLDU_OOCPath, DPSLDU_Ordering, DPSLDU_Preprocess,
DPSLDU_PreprocessZ, DPSLDU_Solve, DPSLDU_SolveM, DPSLDU_Storage-
Parallel sparse unsymmetric solver for linear systems of real equations
SYNOPSIS
Fortran synopsis:
SUBROUTINE DPSLDU_DESTROY (token)
INTEGER token
SUBROUTINE DPSLDU_EXTRACTPERM (token, perm)
INTEGER token, perm(*)
SUBROUTINE DPSLDU_FACTOR (token, n, pointers, indices, values)
INTEGER token, n, pointers(*), indices(*)
DOUBLE PRECISION values(*)
SUBROUTINE DPSLDU_FACTOROOC (token, n, pointers, indices, values)
INTEGER token, n, pointers(*), indices(*)
DOUBLE PRECISION values(*)
SUBROUTINE DPSLDU_OOCLIMIT (token, ooclimit)
INTEGER token
DOUBLE PRECISION ooclimit
SUBROUTINE DPSLDU_OOCPATH (token, oocpath)
INTEGER token
CHARACTER oocpath(*)
SUBROUTINE DPSLDU_ORDERING (token, method)
INTEGER token, method
SUBROUTINE DPSLDU_PREPROCESS (token, n, pointers, indices,
non_zeros, ops)
INTEGER token, n, pointers(*), indices(*)
INTEGER*8 non_zeros
DOUBLE PRECISION ops
SUBROUTINE DPSLDU_PREPROCESSZ (token, n, pointers, indices, mask,
non_zeros, ops)
INTEGER token, n, pointers(*), indices(*), mask(*)
INTEGER*8 non_zeros
DOUBLE PRECISION ops
SUBROUTINE DPSLDU_SOLVE (token, x, b)
INTEGER token
DOUBLE PRECISION x(*), b(*)
Page 1
DPSLDU(3S)DPSLDU(3S)
SUBROUTINE DPSLDU_SOLVEM (token, X, ldx, B, ldb, nrhs)
INTEGER token, ldx, ldb, nrhs
DOUBLE PRECISION X(*), B(*)
DOUBLE PRECISION FUNCTION DPSLDU_STORAGE(token)
INTEGER token
C/C++ synopsis:
#include <scsl_sparse.h>
void DPSLDU_Destroy (int token);
void DPSLDU_ExtractPerm (int token, int perm[]);
void DPSLDU_Factor (int token, int n, int pointers[], int indices[],
double values[]);
void DPSLDU_FactorOOC (int token, int n, int pointers[], int
indices[], double values[]);
void DPSLDU_OOCLimit (int token, double ooclimit);
void DPSLDU_OOCPath (int token, char oocpath[]);
void DPSLDU_Ordering (int token, int method);
void DPSLDU_Preprocess (int token, int n, int pointers[], int
indices[], long long *non_zeros, double *ops);
void DPSLDU_PreprocessZ (int token, int n, int pointers[], int
indices[], int mask[], long long *non_zeros, double *ops);
void DPSLDU_Solve (int token, double x[], double b[]);
void DPSLDU_SolveM (int token, double X[], int ldx, double B[], int
ldb, int nrhs);
double DPSLDU_Storage (int token);
IMPLEMENTATION
These routines are part of the SCSL Scientific Library and can be loaded
using either the -lscs or the -lscs_mp option. The -lscs_mp option
directs the linker to use the multi-processor version of the library.
When linking to SCSL with -lscs or -lscs_mp, the default integer size is
4 bytes (32 bits). Another version of SCSL is available in which integers
are 8 bytes (64 bits). This version allows the user access to larger
memory sizes and helps when porting legacy Cray codes. It can be loaded
by using the -lscs_i8 option or the -lscs_i8_mp option. A program may
use only one of the two versions; 4-byte integer and 8-byte integer
library calls cannot be mixed.
Page 2
DPSLDU(3S)DPSLDU(3S)
The C and C++ prototypes shown above are appropriate for the 4-byte
integer version of SCSL. When using the 8-byte integer version, the
variables of type int become long long and the <scsl_sparse_i8.h> header
file should be included.
DESCRIPTION
DPSLDU solves sparse unsymmetric linear systems of the form Ax = b where
A is a real n-by-n input matrix having symmetric non-zero pattern but
unsymmetric non-zero values, b is a real input vector of length n, and x
is an unknown real vector of length n.
DPSLDU uses a direct method. A is factored into the following form:
A = L D U
where L is a lower triangular matrix with unit diagonal, D is a diagonal
matrix, and U is an upper triangular matrix with unit diagonal.
Note that NO PIVOTING FOR STABILITY is performed during factorization.
The DPSLDU library contains five main routines.
* DPSLDU_Ordering() allows the user to select one of five possible
reordering methods to be used in the matrix preprocessing phase.
* DPSLDU_Preprocess() performs preprocessing operations on the
structure of A (heuristic reordering to reduce fill in L and U,
symbolic factorization, etc.).
* DPSLDU_Factor() factors the matrix A into L and U, using the
previously computed preprocessing data.
* DPSLDU_Solve() solves for a vector x, given an input vector b.
* DPSLDU_Destroy() frees all storage associated with the matrix A
(including L, D, U, and various data structures computed during
preprocessing).
The user can call DPSLDU_Factor() several times after a single call to
DPSLDU_Preprocess() to factor multiple matrices with identical non-zero
structures but different values. Similarly, the user can call
DPSLDU_Solve() several times after a single call to DPSLDU_Factor() to
solve for multiple right-hand-sides. Also, the user can call
DPSLDU_SolveM() to solve for multiple right-hand-sides all stored in a
single array.
Sparse Matrix Format
Sparse matrix A must be input to DPSLDU in Harwell-Boeing format (also
known as Compressed Column Storage format).
Page 3
DPSLDU(3S)DPSLDU(3S)
The matrix is held in three arrays: pointers, indices, and values. The
indices array contains the row indices of the non-zeros in A. The values
array holds the corresponding non-zero values. The pointers array
contains the index in indices for the first non-zero in each column of A.
Thus, the row indices for the non-zeros in column i can be found in
locations indices[pointers[i]] through indices[pointers[i+1]-1]. The
corresponding values can be found in location values[pointers[i]] through
values[pointers[i+1]-1].
DPSLDU imposes one constraint on the representation of the A matrix. The
non-zeros within each column must appear in order of increasing row
number.
In the following example, the unsymmetric matrix
1.0 0.0 5.0 0.0
0.0 3.0 0.0 8.0
2.0 0.0 7.0 0.0
0.0 4.0 0.0 9.0
would be represented in FORTRAN as follows:
INTEGER pointers(5), indices(8), i
DOUBLE PRECISION values(8)
DATA (pointers(i), i = 1, 5) / 1, 3, 5, 7, 9 /
DATA (indices(i), i = 1, 8) / 1, 3, 2, 4, 1, 3, 2, 4 /
DATA (values(i), i = 1, 8) / 1.0, 2.0, 3.0, 4.0, 5.0,
& 7.0, 8.0, 9.0 /
Zero-based indexing is used in C, so the pointers, indices, and values
arrays would contain the following:
int pointers[] = {0, 2, 4, 6, 8};
int indices[] = {0, 2, 1, 3, 0, 2, 1, 3};
double values[] = {1.0, 2.0, 3.0, 4.0, 5.0, 7.0, 8.0, 9.0};
Ordering Methods
The DPSLDU_Ordering(token, method) routine allows the user to change the
ordering method used to pre-order the matrix before factorization. This
routine must be called before calling DPSLDU_Preprocess(). Five options
are currently available for the method parameter:
* Method 0 performs no pre-ordering
* Method 1 performs Approximate Minimum Fill ordering
* Method 2 performs a single nested dissection ordering (default).
This method is often called "Extreme matrix ordering".
Page 4
DPSLDU(3S)DPSLDU(3S)
* Method 3 performs multiple nested dissection orderings (in parallel)
* Method 4 performs multiple nested dissection (the same as in Method
3), but it uses a feedback file to "learn" from the previous solves
of the same matrix structure and it performs more orderings. The
multiple nested dissection technique of Methods 3 and 4 is also
referred to as "Extreme2 matrix ordering".
Method 2 is significantly more expensive than Method 1, but it usually
produces significantly better orderings. Method 3 is especially
effective on multi-processor systems. It computes OMP_NUM_THREADS (where
OMP_NUM_THREADS is an environment variable indicating the number of
processors to be used for parallel computation) matrix orderings using
different starting points for the algorithm and uses the ordering that
will lead to the fewest floating-point operations to factorize the
matrix.
Method 4 is useful only when the same non-zero structure is used for
multiple solves. Method 4 keeps a record in a "feedback" file of a
signature for non-zero structures for a maximum of 200 matrices and of
the starting point that was saved from a previous solve for that
structure. In the next Method 4 ordering for that non-zero structure,
that best starting point and 2 * OMP_NUM_THREADS - 1 new ones generate
orderings. The best ordering is used. In this way, the quality of
orderings stay the same or improve over time.
Methods 3 and 4 typically take more time for the matrix preprocessing
than the default. However, on large systems or on repeated
factorizations, significant overall speedups (1.1X to 2X) can be obtained
compared to Method 2.
Extracting the permutation vector
Unless ordering Method 0 is used, DPSLDU applies a symmetric permutation
to matrix A before the factorization step; the resulting permuted matrix
generally has significantly less fill-in than the original matrix. The
user can obtain the permutation matrix associated with a given token by
calling DPSLDU_ExtractPerm(token, perm). The permutation is returned as
an integer array of length n, with 1 <= perm(i) <= n (0 <= perm[i] < n
for C code).
A value of k for perm(i) implies that node k in the original ordering is
node i in the new ordering.
Matrices with zeros on the diagonal
As noted above, no pivoting is done for stability during factorization;
when zero or near-zero pivots are encountered, DPSLDU usually fails. In
these cases, it may be possible to use DPSLDU_PreprocessZ() to obtain a
slightly different, but stable, ordering. The user provides an
additional integer array, mask, as an argument to DPSLDU_PreprocessZ().
If mask(i)=0, then DPSLDU will attempt to maximize the diagonal element
|Aii|.
Page 5
DPSLDU(3S)DPSLDU(3S)
Memory usage
The returned value of DPSLDU_Storage() is an estimate of the amount of
storage required (in millions of bytes) by the solver's data structures
for a given matrix system.
Out-of-core Factorization
The storage associated with the factor can be managed in two ways. The
DPSLDU_Factor() routine allocates memory for the factor and manages it
internally, releasing it only when DPSLDU_Destroy() is called. The
alternative is to do out-of-core factorization by calling
DPSLDU_FactorOOC(). This routine uses a small amount of in-core memory,
placing the remainder of the factor matrix on disk as it is computed.
The user can call DPSLDU_OOCPath() to indicate the directory in which the
factor file should be written, and DPSLDU_OOCLimit() to indicate how much
memory to use to hold portions of the factor matrix in-core. More in-
core memory generally leads to less disk I/O and higher performance
during the factorization. The only required change is to move from in-
core factorization to out-of-core factorization is the change from
DPSLDU_Factor() to DPSLDU_FactorOOC(). The other routines
(DPSLDU_Solve(), DPSLDU_Destroy(), etc.) handle out-of-core factors
transparently. Note that DPSLDU_FactorOOC() and subsequent calls to
DPSLDU_Solve() are not parallelized (but calls to DPSLDU_SolveM() are
parallelized, as discussed below).
Multiple Right-Hand-Sides
DPSLDU can solve for large numbers of right-hand-sides with one call to
DPSLDU_SolveM(). It solves these right hand sides in parallel, with each
processor solving up to four at a time for in-core systems and up to
PSLDU_OOCBLK at a time for out-of-core systems, where PSLDU_OOCBLK is an
environment variable whose default value is 1.
In-place Solves
Both DPSLDU_Solve() and DPSLDU_SolveM() allow the solution vector(s) to
overwrite the right-hand-side(s) when identical vectors or matrices are
supplied to these routines. For example,
CALL DPSLDU_SOLVE(token, b, b)
takes the right-hand-side input from b and also returns the solution
vector in b. When this option is used with DPSLDU_SolveM(), the leading
dimensions for the solution and right-hand-side matrices must agree. The
amount of memory actually saved by performing an in-place solve depends
on the number of right-hand-sides used. For a single right-hand-side,
there are no net savings versus an out-of-place solve because a temporary
copy of the input vector is made internally. For multiple right-hand-
sides the memory overhead decreases as the ratio of right-hand-sides to
processors used increases.
Arguments
These routines have the following arguments:
Page 6
DPSLDU(3S)DPSLDU(3S)
token (input) DPSLDU can handle multiple matrices simultaneously. The
token distinguishes between active matrices. The token passed
to DPSLDU_Factor() must match the token used in some previous
call to DPSLDU_Preprocess(). Similarly, the token passed to
DPSLDU_Solve() must match the token used in some previous call
to DPSLDU_Factor(). 0 <= token <= 19.
method (input) An integer specifying the ordering method used during
preprocessing. 0 <= method <= 4.
n (input) The number of rows and columns in the matrix A. n >=
0.
pointers, indices, values
(input) The pointers and indices arrays store the non-zero
structure of sparse input matrix A in Harwell-Boeing or
Compressed Sparse Column (CSC) format.
The pointers array stores n+1 integers, where pointers[i] gives
the index in indices of the first non-zero in column i of A.
The indices array stores the row indices of the non-zeros in A.
The values array stores the non-zero values in the matrix A.
non_zeros (output) The number of non-zero values in L and U.
ops (output) The number of floating-point operations required to
factor A.
mask (input) An integer array of length n used in
DPSLDU_PreprocessZ(). If mask(i) = 0, then node i of matrix A
is ordered after all of its neighbors in an attempt to avoid a
zero pivot.
b (input) The right-hand-side vector in a DPSLDU_Solve() call.
x (output) The solution vector in a DPSLDU_Solve() call.
nrhs (input) The number of right-hand side vectors present in a
DPSLDU_SolveM() call.
B (input) The right-hand-side matrix in a DPSLDU_SolveM() call.
Must be stored in column-major order.
ldb (input) The leading dimension of matrix B. ldb >= n.
X (output) The solution matrix in a DPSLDU_SolveM() call. Must be
stored in column-major order.
ldx (input) The leading dimension of matrix X. ldx >= n.
Page 7
DPSLDU(3S)DPSLDU(3S)
oocpath (input) A character array/string with a path to the directory
where the temporary out-of-core factor files should be stored.
If this path is on a striped (or raid-0) file system, the
performance of the out-of-core solves can be considerably
improved. The default path is /usr/tmp.
ooclimit (input) A double precision number indicating the number of
Mbytes of random access memory that should be used for factor
storage during a call to DPSLDU_FactorOOC(). Note that there
are many other arrays used besides those directly used to store
the factorization, so total RAM usage by the solve will exceed
this number. The default is 64 MB.
perm (output) An integer array of length n containing the
permutation used to reorder matrix A.
ENVIRONMENT VARIABLES
Two environment variables can affect the operation of ordering methods 3
and 4. SPARSE_NUM_ORDERS can be used to change the number of orderings
performed from the default of OMP_NUM_THREADS for Method 3 and
(2*OMP_NUM_THREADS) for Method 4. SPARSE_FEEDBACK_FILE can be set to the
path and file name where the feedback information will be kept;
otherwise, the default feedback file is $HOME/.sparseFeedback. This file
will be less than 5K bytes.
The environment variable OMP_NUM_THREADS determines the number of
processors that are used for the numerical factorization and solve
phases. Out-of-core solves can be performed in groups of PSLDU_OOCBLK
right-hand-sides per processor. Setting the environment variable
PSLDU_VERBOSE causes DPSLDU to output information about the
factorization.
NOTES
These routines are optimized and parallelized for the SGI R8000 and
R1x000 platforms.
SEE ALSOINTRO_SCSL(3S), INTRO_SOLVERS(3S), DPSLDLT(3S), ZPSLDLT(3S), ZPSLDU(3S)
Page 8