Getting Started

This chapter provides a quick introduction to SLATE through a complete example program that solves a linear system \(AX = B\) using LU factorization.

Example Program: LU Solve

The following example demonstrates how to set up SLATE matrices and solve a linear system using the distributed lu_solve implementation (also known as gesv in traditional LAPACK naming).

#include <slate/slate.hh>
#include <blas.hh>
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv)
{
    // Initialize MPI with thread support
    int mpi_provided = 0;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mpi_provided);
    if (mpi_provided != MPI_THREAD_MULTIPLE) {
        throw std::runtime_error("MPI_THREAD_MULTIPLE required");
    }

    int mpi_rank, mpi_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

    // Problem parameters
    int64_t n = 5000;      // Matrix dimension
    int64_t nrhs = 1;      // Number of right-hand sides
    int64_t nb = 256;      // Tile size
    int p = 2, q = 2;      // Process grid dimensions

    // Verify we have enough MPI processes
    if (mpi_size < p * q) {
        if (mpi_rank == 0) {
            printf("Need at least %d MPI processes\n", p * q);
        }
        MPI_Finalize();
        return 1;
    }

    // Create SLATE matrices
    slate::Matrix<double> A(n, n, nb, p, q, MPI_COMM_WORLD);
    slate::Matrix<double> B(n, nrhs, nb, p, q, MPI_COMM_WORLD);

    // Allocate local tiles on each process
    A.insertLocalTiles();
    B.insertLocalTiles();

    // Initialize with random data (each rank different seed)
    srand(100 * mpi_rank);
    // ... initialize A and B tiles ...

    // Solve AX = B using LU factorization
    // Solution X overwrites B
    slate::lu_solve(A, B);

    MPI_Finalize();
    return 0;
}

Understanding the Example

MPI Initialization

SLATE requires MPI to be initialized with MPI_THREAD_MULTIPLE support, as SLATE uses OpenMP threads internally that may make MPI calls:

MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mpi_provided);

Creating Matrices

SLATE matrices are defined with:

  • Dimensions: n × n for matrix A, n × nrhs for matrix B

  • Tile size: nb × nb blocks (256 is a reasonable default)

  • Process grid: p × q grid of MPI processes

  • Communicator: MPI communicator for collective operations

slate::Matrix<double> A(n, n, nb, p, q, MPI_COMM_WORLD);

The matrices are distributed in a 2D block-cyclic pattern across the process grid, similar to ScaLAPACK.

Allocating Tiles

After creating the matrix structure, local tiles must be allocated:

A.insertLocalTiles();

This allocates memory for tiles that belong to the current process based on the 2D block-cyclic distribution. SLATE can also work with user-provided memory or ScaLAPACK-style layouts.

Solving the System

The lu_solve function factors A and solves the system in one call:

slate::lu_solve(A, B);

Internally, this performs:

  1. LU factorization: \(PA = LU\) with partial pivoting

  2. Forward substitution: Solve \(LY = PB\)

  3. Back substitution: Solve \(UX = Y\)

The solution X overwrites B.

Execution Options

SLATE routines accept optional parameters to control execution:

slate::Options opts = {
    {slate::Option::Target, slate::Target::HostTask},
    {slate::Option::Lookahead, 2}
};
slate::lu_solve(A, B, opts);

Common options include:

  • Target: Execution target (HostTask, Devices, etc.)

  • Lookahead: Depth of lookahead for overlapping computation and communication

  • InnerBlocking: Inner blocking size for panel operations

Building the Example

Compile with the MPI C++ wrapper and link against SLATE libraries:

# Set paths
export SLATE_ROOT=/path/to/slate
export BLASPP_ROOT=${SLATE_ROOT}/blaspp
export LAPACKPP_ROOT=${SLATE_ROOT}/lapackpp

# Compile
mpicxx -fopenmp -c example.cc \
    -I${SLATE_ROOT}/include \
    -I${BLASPP_ROOT}/include \
    -I${LAPACKPP_ROOT}/include

# Link
mpicxx -fopenmp -o example example.o \
    -L${SLATE_ROOT}/lib -Wl,-rpath,${SLATE_ROOT}/lib \
    -L${BLASPP_ROOT}/lib -Wl,-rpath,${BLASPP_ROOT}/lib \
    -L${LAPACKPP_ROOT}/lib -Wl,-rpath,${LAPACKPP_ROOT}/lib \
    -lslate -llapackpp -lblaspp

For CUDA support, add:

-L${CUDA_HOME}/lib64 -Wl,-rpath,${CUDA_HOME}/lib64 \
-lcusolver -lcublas -lcudart

Running the Example

Run with MPI:

export OMP_NUM_THREADS=8
mpirun -n 4 ./example

Expected output:

lu_solve n 5000, nb 256, p-by-q 2-by-2, residual 8.41e-20, tol 2.22e-16,
time 7.65e-01 sec, pass

Simplifying Assumptions

The example uses several defaults that may need tuning for optimal performance:

Tile Size (nb=256)

Should be tuned for the target architecture. Larger tiles (512-1024) often work better for GPUs.

Process Grid (p=2, q=2)

Square or near-square grids typically provide best performance. Avoid 1D grids (p=1 or q=1).

Data Distribution

Default is 2D block-cyclic. Custom distributions can be specified using lambda functions.

Execution Target

Default is HostTask (CPU with OpenMP tasks). Set to Devices for GPU acceleration.

Next Steps