Handling of Precisions
======================

SLATE handles multiple precisions with C++ templating, so there is only one precision-independent version of the code, which is then instantiated for the desired precisions. Operations are defined to apply consistently across all precisions. For instance, ``blas::conj`` extends ``std::conj`` to apply to real precisions (float, double), where it is a no-op. Whereas ``std::conj`` applied to a real number returns ``std::complex`` with a zero imaginary part, which is not generally what is desired in linear algebra algorithms. For instance, ``alpha = blas::conj(alpha);`` works for real numbers, but ``std::conj`` would not work.

SLATE's BLAS++ component provides overloaded, precision-independent wrappers for all the underlying node-level BLAS, which SLATE's PBLAS are built on top of. For instance, ``blas::gemm`` in BLAS++ maps to the classical ``sgemm``, ``dgemm``, ``cgemm``, or ``zgemm`` BLAS, depending on the precision of its arguments. For real arithmetic, symmetric and Hermitian matrices are considered interchangeable, so ``hemm`` maps to ``symm``, ``herk`` to ``syrk``, and ``her2k`` to ``syr2k``. This mapping aides in templating higher-level routines, such as Cholesky, which does a ``herk`` (mapped to ``syrk`` in real) to update the trailing matrix.

Currently, the SLATE library has explicit instantiations of the four main data types: ``float``, ``double``, ``std::complex<float>``, and ``std::complex<double>``. The SLATE code should accommodate other data types, such as half, double-double, or quad precision, given appropriate underlying node-level BLAS. For instance, Intel oneMKL, NVIDIA cuBLAS, and AMD rocBLAS provide half-precision ``gemm`` operations.

SLATE also implements mixed-precision algorithms that factor a matrix in low precision, then use iterative refinement to attain a high-precision final result. These exploit the faster processing in low precision for the :math:`O(n^3)` factorization work, while refinement in the slower high precision is only :math:`O(n^2)` work. In SLATE, the low and high precisions are independently templated; currently we use the traditional single and double combination. However, recent interest in half precision has led to algorithms using it with either single or double. One could also go to higher precisions, using double-double or quad for the high precision.

By adding the relevant underlying node-level BLAS operations in the desired precisions to BLAS++, the templated nature of SLATE greatly simplifies instantiating different combinations of precisions.