Testing and Tuning

SLATE includes a comprehensive testing suite for verifying correctness and measuring performance. This chapter covers the tester, performance tuning, and unit tests.

SLATE Tester

The SLATE tester is built in the test/ directory and exercises all library functionality.

Basic Usage

cd test

# List available tests
./tester --help

# Quick test of gemm with small defaults
./tester gemm

# List options for a specific routine
./tester --help gemm

Single-Process Testing

# Sweep over matrix dimensions
./tester --nb 256 --dim 1000:5000:1000 gemm

# Multiple data types
./tester --type s,d,c,z gemm

# Different execution targets
./tester --target t gemm   # HostTask (CPU)
./tester --target d gemm   # Devices (GPU)

Multi-Process Testing

# Using mpirun
mpirun -n 4 ./tester --nb 256 --dim 1000:5000:1000 --grid 2x2 gemm

# Using Slurm
srun --nodes=4 --ntasks=16 --cpus-per-task=8 ./tester --grid 4x4 gemm

Tester Parameters

Common parameters for the tester:

--check          check results (default: y)
--ref            run ScaLAPACK reference (default: n)
--tol            error tolerance (default: 50)
--repeat         repetitions per test (default: 1)
--verbose        output verbosity level (0-4)

--type           data type: s, d, c, z (default: d)
--origin         data origin: h=Host, s=ScaLAPACK, d=Devices
--target         execution target: t=HostTask, n=HostNest,
                 b=HostBatch, d=Devices (default: t)

--transA/--transB   transpose: n, t, c (default: n)
--uplo           upper/lower: u, l (default: l)
--diag           diagonal: u=unit, n=non-unit (default: n)

--dim            m x n x k dimensions
--nb             tile size (default: 384)
--grid           p x q MPI grid
--lookahead      lookahead panels (default: 1)

Example Tester Output

% SLATE version 2023.08.25
% input: ./tester gemm
% MPI:   4 ranks, CPU-only MPI, 8 OpenMP threads per MPI rank

gemm dtype=d, origin=h, target=t, transA=n, transB=n, m=100, n=100, k=100,
    alpha=1+0i, beta=1+0i, nb=100, grid=2x2, la=1
    time=0.00123, gflop/s=1.63, ref_time=0.00089, ref_gflop/s=2.24,
    error=1.2e-15, okay

Accuracy Verification

SLATE uses backward error analysis for verification. Most routines check accuracy in two ways:

Without Reference (–ref=n)

Fast check using algebraic identities. For gemm \(C = \alpha AB + \beta C\):

\[ \begin{align}\begin{aligned}Y_1 = \alpha A(BX) + (\beta C_{in} X)\\Y_2 = C_{out} X\\\text{error} = \frac{\|Y_1 - Y_2\|}{\|Y_1\|}\end{aligned}\end{align} \]

With Reference (–ref=y)

Compare against ScaLAPACK reference implementation. Slower but more robust.

./tester --ref=y gemm

Norm Verification

Matrix norms compare against ScaLAPACK. Note that older ScaLAPACK versions have accuracy issues in norm computation.

Full Testing Suite

The run_tests.py script runs comprehensive tests:

cd test

# View options
python3 ./run_tests.py --help

# Default full test suite
python3 ./run_tests.py --xml ../report.xml

# Small quick tests
python3 ./run_tests.py --xsmall

# Specific routines
python3 ./run_tests.py --xsmall gesv potrf

Custom Test Commands

# Using Slurm
python3 ./run_tests.py \
    --test "salloc -N 4 -n 4 -t 10 mpirun -n 4 ./tester" \
    --xsmall gesv

# GPU execution
python3 ./run_tests.py \
    --test "mpirun -n 4 ./tester" \
    --xsmall --target d gesv

Performance Tuning

Tile Size

Tile size (nb) significantly affects performance:

# Sweep tile sizes for CPU
./tester --type d --target t --dim 3000 --nb 128:512:32 gesv

# Sweep tile sizes for GPU (typically larger, multiple of 64)
./tester --type d --target d --dim 10000 --nb 192:1024:64 gesv

General guidelines:

CPU: 128-512, depending on cache sizes
NVIDIA GPU: 384-1024, multiples of 64
AMD GPU: 256-768, multiples of 64

Process Grid

Near-square grids usually provide best performance:

# Test different grids
mpirun -n 4 ./tester --nb 256 --dim 10000 --grid 1x4,2x2 gemm

# 1D grids (1xq or px1) are typically slower

Avoid 1D grids as they lead to higher communication overhead.

Lookahead

Lookahead depth for overlapping communication and computation:

./tester --dim 10000 --lookahead 1,2,4 gesv

Default of 1 is usually sufficient. Higher values require more memory.

Panel Threads

Control threads used in panel operations:

export OMP_NUM_THREADS=32
# Panel uses min(OMP_NUM_THREADS/2, MaxPanelThreads)

Multi-threaded MPI Broadcast

For some systems, multi-threaded MPI broadcast improves performance. Enable by building with:

# In make.inc
CXXFLAGS += -DSLATE_HAVE_MT_BCAST

Warning

On some systems (e.g., Frontier with GPU-aware MPI), this can cause hangs. Test carefully.

GPU-Aware MPI

Enable for direct GPU-to-GPU transfers:

export SLATE_GPU_AWARE_MPI=1

# For Cray MPI
export MPICH_GPU_SUPPORT_ENABLED=1

Performance Examples

# Tune tile size for double precision gemm on GPU
mpirun -n 4 ./tester --type d --target d --dim 10000 \
       --nb 256,384,512,640,768 gemm

# Compare process grids for Cholesky
mpirun -n 16 ./tester --type d --target d --dim 20000 --nb 512 \
       --grid 1x16,2x8,4x4 potrf

# Test lookahead for LU
mpirun -n 16 ./tester --target d --dim 20000 --nb 512 \
       --grid 4x4 --lookahead 1,2,4 getrf

Unit Tests

Unit tests verify individual SLATE components (matrix classes, memory manager, tiles):

cd unit_test

# Run default unit tests
python3 ./run_tests.py --xml ../report_unit.xml

# Specific test
./unit_test Matrix

Benchmark Suite

For production benchmarking:

# Disable result checking for timing
./tester --check=n --repeat=3 gemm

# Compare with ScaLAPACK
./tester --ref=y --repeat=3 gemm

Troubleshooting Tests

Common Issues

Test failures with old ScaLAPACK

Older ScaLAPACK has accuracy bugs in norms. Update or use --ref=n.

GPU tests hang

Try disabling GPU-aware MPI:

unset SLATE_GPU_AWARE_MPI

Memory errors

Reduce problem size or number of OpenMP threads:

export OMP_NUM_THREADS=4
./tester --dim 1000 gesv

Debugging

# Verbose output
./tester --verbose=2 gemm

# Print matrices (small problems only!)
./tester --verbose=4 --dim 10 gemm

# Debug with specific rank
./tester --debug=0 gemm  # rank 0 waits for debugger