Getting Started =============== This chapter provides a quick introduction to SLATE through a complete example program that solves a linear system :math:`AX = B` using LU factorization. Example Program: LU Solve ------------------------- The following example demonstrates how to set up SLATE matrices and solve a linear system using the distributed ``lu_solve`` implementation (also known as ``gesv`` in traditional LAPACK naming). .. code-block:: cpp #include #include #include #include int main(int argc, char** argv) { // Initialize MPI with thread support int mpi_provided = 0; MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mpi_provided); if (mpi_provided != MPI_THREAD_MULTIPLE) { throw std::runtime_error("MPI_THREAD_MULTIPLE required"); } int mpi_rank, mpi_size; MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank); MPI_Comm_size(MPI_COMM_WORLD, &mpi_size); // Problem parameters int64_t n = 5000; // Matrix dimension int64_t nrhs = 1; // Number of right-hand sides int64_t nb = 256; // Tile size int p = 2, q = 2; // Process grid dimensions // Verify we have enough MPI processes if (mpi_size < p * q) { if (mpi_rank == 0) { printf("Need at least %d MPI processes\n", p * q); } MPI_Finalize(); return 1; } // Create SLATE matrices slate::Matrix A(n, n, nb, p, q, MPI_COMM_WORLD); slate::Matrix B(n, nrhs, nb, p, q, MPI_COMM_WORLD); // Allocate local tiles on each process A.insertLocalTiles(); B.insertLocalTiles(); // Initialize with random data (each rank different seed) srand(100 * mpi_rank); // ... initialize A and B tiles ... // Solve AX = B using LU factorization // Solution X overwrites B slate::lu_solve(A, B); MPI_Finalize(); return 0; } Understanding the Example ------------------------- MPI Initialization ~~~~~~~~~~~~~~~~~~ SLATE requires MPI to be initialized with ``MPI_THREAD_MULTIPLE`` support, as SLATE uses OpenMP threads internally that may make MPI calls: .. code-block:: cpp MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mpi_provided); Creating Matrices ~~~~~~~~~~~~~~~~~ SLATE matrices are defined with: - **Dimensions**: ``n × n`` for matrix A, ``n × nrhs`` for matrix B - **Tile size**: ``nb × nb`` blocks (256 is a reasonable default) - **Process grid**: ``p × q`` grid of MPI processes - **Communicator**: MPI communicator for collective operations .. code-block:: cpp slate::Matrix A(n, n, nb, p, q, MPI_COMM_WORLD); The matrices are distributed in a 2D block-cyclic pattern across the process grid, similar to ScaLAPACK. Allocating Tiles ~~~~~~~~~~~~~~~~ After creating the matrix structure, local tiles must be allocated: .. code-block:: cpp A.insertLocalTiles(); This allocates memory for tiles that belong to the current process based on the 2D block-cyclic distribution. SLATE can also work with user-provided memory or ScaLAPACK-style layouts. Solving the System ~~~~~~~~~~~~~~~~~~ The ``lu_solve`` function factors A and solves the system in one call: .. code-block:: cpp slate::lu_solve(A, B); Internally, this performs: 1. LU factorization: :math:`PA = LU` with partial pivoting 2. Forward substitution: Solve :math:`LY = PB` 3. Back substitution: Solve :math:`UX = Y` The solution X overwrites B. Execution Options ~~~~~~~~~~~~~~~~~ SLATE routines accept optional parameters to control execution: .. code-block:: cpp slate::Options opts = { {slate::Option::Target, slate::Target::HostTask}, {slate::Option::Lookahead, 2} }; slate::lu_solve(A, B, opts); Common options include: - ``Target``: Execution target (``HostTask``, ``Devices``, etc.) - ``Lookahead``: Depth of lookahead for overlapping computation and communication - ``InnerBlocking``: Inner blocking size for panel operations Building the Example -------------------- Compile with the MPI C++ wrapper and link against SLATE libraries: .. code-block:: bash # Set paths export SLATE_ROOT=/path/to/slate export BLASPP_ROOT=${SLATE_ROOT}/blaspp export LAPACKPP_ROOT=${SLATE_ROOT}/lapackpp # Compile mpicxx -fopenmp -c example.cc \ -I${SLATE_ROOT}/include \ -I${BLASPP_ROOT}/include \ -I${LAPACKPP_ROOT}/include # Link mpicxx -fopenmp -o example example.o \ -L${SLATE_ROOT}/lib -Wl,-rpath,${SLATE_ROOT}/lib \ -L${BLASPP_ROOT}/lib -Wl,-rpath,${BLASPP_ROOT}/lib \ -L${LAPACKPP_ROOT}/lib -Wl,-rpath,${LAPACKPP_ROOT}/lib \ -lslate -llapackpp -lblaspp For CUDA support, add: .. code-block:: bash -L${CUDA_HOME}/lib64 -Wl,-rpath,${CUDA_HOME}/lib64 \ -lcusolver -lcublas -lcudart Running the Example ------------------- Run with MPI: .. code-block:: bash export OMP_NUM_THREADS=8 mpirun -n 4 ./example Expected output: .. code-block:: text lu_solve n 5000, nb 256, p-by-q 2-by-2, residual 8.41e-20, tol 2.22e-16, time 7.65e-01 sec, pass Simplifying Assumptions ----------------------- The example uses several defaults that may need tuning for optimal performance: **Tile Size (nb=256)** Should be tuned for the target architecture. Larger tiles (512-1024) often work better for GPUs. **Process Grid (p=2, q=2)** Square or near-square grids typically provide best performance. Avoid 1D grids (p=1 or q=1). **Data Distribution** Default is 2D block-cyclic. Custom distributions can be specified using lambda functions. **Execution Target** Default is ``HostTask`` (CPU with OpenMP tasks). Set to ``Devices`` for GPU acceleration. Next Steps ---------- - :doc:`installation`: Detailed installation instructions - :doc:`matrices`: Understanding SLATE matrix types and operations - :doc:`operations`: Guide to all SLATE operations - :doc:`../api/slate/index`: Complete API reference