Testing and Tuning
==================

SLATE includes a comprehensive testing suite for verifying correctness and measuring performance. 
This chapter covers the tester, performance tuning, and unit tests.


SLATE Tester
------------

The SLATE tester is built in the ``test/`` directory and exercises all library functionality.

Basic Usage
~~~~~~~~~~~

.. code-block:: bash

    cd test
    
    # List available tests
    ./tester --help
    
    # Quick test of gemm with small defaults
    ./tester gemm
    
    # List options for a specific routine
    ./tester --help gemm


Single-Process Testing
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Sweep over matrix dimensions
    ./tester --nb 256 --dim 1000:5000:1000 gemm
    
    # Multiple data types
    ./tester --type s,d,c,z gemm
    
    # Different execution targets
    ./tester --target t gemm   # HostTask (CPU)
    ./tester --target d gemm   # Devices (GPU)


Multi-Process Testing
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Using mpirun
    mpirun -n 4 ./tester --nb 256 --dim 1000:5000:1000 --grid 2x2 gemm
    
    # Using Slurm
    srun --nodes=4 --ntasks=16 --cpus-per-task=8 ./tester --grid 4x4 gemm


Tester Parameters
~~~~~~~~~~~~~~~~~

Common parameters for the tester:

.. code-block:: text

    --check          check results (default: y)
    --ref            run ScaLAPACK reference (default: n)
    --tol            error tolerance (default: 50)
    --repeat         repetitions per test (default: 1)
    --verbose        output verbosity level (0-4)
    
    --type           data type: s, d, c, z (default: d)
    --origin         data origin: h=Host, s=ScaLAPACK, d=Devices
    --target         execution target: t=HostTask, n=HostNest, 
                     b=HostBatch, d=Devices (default: t)
    
    --transA/--transB   transpose: n, t, c (default: n)
    --uplo           upper/lower: u, l (default: l)
    --diag           diagonal: u=unit, n=non-unit (default: n)
    
    --dim            m x n x k dimensions
    --nb             tile size (default: 384)
    --grid           p x q MPI grid
    --lookahead      lookahead panels (default: 1)


Example Tester Output
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: text

    % SLATE version 2023.08.25
    % input: ./tester gemm
    % MPI:   4 ranks, CPU-only MPI, 8 OpenMP threads per MPI rank
    
    gemm dtype=d, origin=h, target=t, transA=n, transB=n, m=100, n=100, k=100,
        alpha=1+0i, beta=1+0i, nb=100, grid=2x2, la=1
        time=0.00123, gflop/s=1.63, ref_time=0.00089, ref_gflop/s=2.24,
        error=1.2e-15, okay


Accuracy Verification
---------------------

SLATE uses backward error analysis for verification. Most routines check accuracy in two ways:

Without Reference (--ref=n)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fast check using algebraic identities. For gemm :math:`C = \alpha AB + \beta C`:

.. math::

    Y_1 = \alpha A(BX) + (\beta C_{in} X)
    
    Y_2 = C_{out} X
    
    \text{error} = \frac{\|Y_1 - Y_2\|}{\|Y_1\|}

With Reference (--ref=y)
~~~~~~~~~~~~~~~~~~~~~~~~

Compare against ScaLAPACK reference implementation. Slower but more robust.

.. code-block:: bash

    ./tester --ref=y gemm


Norm Verification
~~~~~~~~~~~~~~~~~

Matrix norms compare against ScaLAPACK. Note that older ScaLAPACK versions have accuracy 
issues in norm computation.


Full Testing Suite
------------------

The ``run_tests.py`` script runs comprehensive tests:

.. code-block:: bash

    cd test
    
    # View options
    python3 ./run_tests.py --help
    
    # Default full test suite
    python3 ./run_tests.py --xml ../report.xml
    
    # Small quick tests
    python3 ./run_tests.py --xsmall
    
    # Specific routines
    python3 ./run_tests.py --xsmall gesv potrf


Custom Test Commands
~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Using Slurm
    python3 ./run_tests.py \
        --test "salloc -N 4 -n 4 -t 10 mpirun -n 4 ./tester" \
        --xsmall gesv
    
    # GPU execution
    python3 ./run_tests.py \
        --test "mpirun -n 4 ./tester" \
        --xsmall --target d gesv


Performance Tuning
------------------

Tile Size
~~~~~~~~~

Tile size (``nb``) significantly affects performance:

.. code-block:: bash

    # Sweep tile sizes for CPU
    ./tester --type d --target t --dim 3000 --nb 128:512:32 gesv
    
    # Sweep tile sizes for GPU (typically larger, multiple of 64)
    ./tester --type d --target d --dim 10000 --nb 192:1024:64 gesv

General guidelines:

- **CPU**: 128-512, depending on cache sizes
- **NVIDIA GPU**: 384-1024, multiples of 64
- **AMD GPU**: 256-768, multiples of 64


Process Grid
~~~~~~~~~~~~

Near-square grids usually provide best performance:

.. code-block:: bash

    # Test different grids
    mpirun -n 4 ./tester --nb 256 --dim 10000 --grid 1x4,2x2 gemm
    
    # 1D grids (1xq or px1) are typically slower

Avoid 1D grids as they lead to higher communication overhead.


Lookahead
~~~~~~~~~

Lookahead depth for overlapping communication and computation:

.. code-block:: bash

    ./tester --dim 10000 --lookahead 1,2,4 gesv

Default of 1 is usually sufficient. Higher values require more memory.


Panel Threads
~~~~~~~~~~~~~

Control threads used in panel operations:

.. code-block:: bash

    export OMP_NUM_THREADS=32
    # Panel uses min(OMP_NUM_THREADS/2, MaxPanelThreads)


Multi-threaded MPI Broadcast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For some systems, multi-threaded MPI broadcast improves performance. Enable by building with:

.. code-block:: make

    # In make.inc
    CXXFLAGS += -DSLATE_HAVE_MT_BCAST

.. warning::

    On some systems (e.g., Frontier with GPU-aware MPI), this can cause hangs. Test carefully.


GPU-Aware MPI
~~~~~~~~~~~~~

Enable for direct GPU-to-GPU transfers:

.. code-block:: bash

    export SLATE_GPU_AWARE_MPI=1
    
    # For Cray MPI
    export MPICH_GPU_SUPPORT_ENABLED=1


Performance Examples
~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Tune tile size for double precision gemm on GPU
    mpirun -n 4 ./tester --type d --target d --dim 10000 \
           --nb 256,384,512,640,768 gemm
    
    # Compare process grids for Cholesky
    mpirun -n 16 ./tester --type d --target d --dim 20000 --nb 512 \
           --grid 1x16,2x8,4x4 potrf
    
    # Test lookahead for LU
    mpirun -n 16 ./tester --target d --dim 20000 --nb 512 \
           --grid 4x4 --lookahead 1,2,4 getrf


Unit Tests
----------

Unit tests verify individual SLATE components (matrix classes, memory manager, tiles):

.. code-block:: bash

    cd unit_test
    
    # Run default unit tests
    python3 ./run_tests.py --xml ../report_unit.xml
    
    # Specific test
    ./unit_test Matrix


Benchmark Suite
---------------

For production benchmarking:

.. code-block:: bash

    # Disable result checking for timing
    ./tester --check=n --repeat=3 gemm
    
    # Compare with ScaLAPACK
    ./tester --ref=y --repeat=3 gemm


Troubleshooting Tests
---------------------

Common Issues
~~~~~~~~~~~~~

**Test failures with old ScaLAPACK**

Older ScaLAPACK has accuracy bugs in norms. Update or use ``--ref=n``.

**GPU tests hang**

Try disabling GPU-aware MPI:

.. code-block:: bash

    unset SLATE_GPU_AWARE_MPI

**Memory errors**

Reduce problem size or number of OpenMP threads:

.. code-block:: bash

    export OMP_NUM_THREADS=4
    ./tester --dim 1000 gesv


Debugging
~~~~~~~~~

.. code-block:: bash

    # Verbose output
    ./tester --verbose=2 gemm
    
    # Print matrices (small problems only!)
    ./tester --verbose=4 --dim 10 gemm
    
    # Debug with specific rank
    ./tester --debug=0 gemm  # rank 0 waits for debugger