Handling of Precisions

SLATE handles multiple precisions with C++ templating, so there is only one precision-independent version of the code, which is then instantiated for the desired precisions. Operations are defined to apply consistently across all precisions. For instance, blas::conj extends std::conj to apply to real precisions (float, double), where it is a no-op. Whereas std::conj applied to a real number returns std::complex with a zero imaginary part, which is not generally what is desired in linear algebra algorithms. For instance, alpha = blas::conj(alpha); works for real numbers, but std::conj would not work.

SLATE’s BLAS++ component provides overloaded, precision-independent wrappers for all the underlying node-level BLAS, which SLATE’s PBLAS are built on top of. For instance, blas::gemm in BLAS++ maps to the classical sgemm, dgemm, cgemm, or zgemm BLAS, depending on the precision of its arguments. For real arithmetic, symmetric and Hermitian matrices are considered interchangeable, so hemm maps to symm, herk to syrk, and her2k to syr2k. This mapping aides in templating higher-level routines, such as Cholesky, which does a herk (mapped to syrk in real) to update the trailing matrix.

Currently, the SLATE library has explicit instantiations of the four main data types: float, double, std::complex<float>, and std::complex<double>. The SLATE code should accommodate other data types, such as half, double-double, or quad precision, given appropriate underlying node-level BLAS. For instance, Intel oneMKL, NVIDIA cuBLAS, and AMD rocBLAS provide half-precision gemm operations.

SLATE also implements mixed-precision algorithms that factor a matrix in low precision, then use iterative refinement to attain a high-precision final result. These exploit the faster processing in low precision for the \(O(n^3)\) factorization work, while refinement in the slower high precision is only \(O(n^2)\) work. In SLATE, the low and high precisions are independently templated; currently we use the traditional single and double combination. However, recent interest in half precision has led to algorithms using it with either single or double. One could also go to higher precisions, using double-double or quad for the high precision.

By adding the relevant underlying node-level BLAS operations in the desired precisions to BLAS++, the templated nature of SLATE greatly simplifies instantiating different combinations of precisions.