Device (GPU) Operations

GPU device management and asynchronous LAPACK operations using cuSOLVER, rocSOLVER, or SYCL.

Queue Management

class Queue : public blas::Queue

GPU device queue for asynchronous LAPACK operations.

Extends blas::Queue with cuSOLVER/rocSOLVER handle management. Provides asynchronous execution of LAPACK routines on GPU devices including factorizations (potrf, getrf), solves (trsm), and eigenvalue/SVD computations.

Public Functions

inline Queue()
inline Queue(int device)
inline ~Queue()
Queue(Queue const&) = delete
Queue &operator=(Queue const&) = delete

GPU Operations

Cholesky Factorization

template<typename scalar_t>
void lapack::potrf(lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, device_info_int *dev_info, lapack::Queue &queue)

Cholesky factorization of Hermitian positive definite matrix on GPU.

Computes the Cholesky factorization of an n-by-n Hermitian positive definite matrix A on device memory:

  • If uplo = Upper: \( A = U^H U \)

  • If uplo = Lower: \( A = L L^H \)

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:
  • uplo[in] Whether upper or lower triangle of A is stored

  • n[in] Matrix dimension. n >= 0

  • dA[inout] Device pointer to n-by-n matrix A with leading dimension ldda On exit, factor U or L stored in respective triangle

  • ldda[in] Leading dimension of dA. ldda >= max(1,n)

  • dev_info[out] Device pointer to info status:

    • 0: successful

    • i > 0: U(i,i) or L(i,i) is zero, matrix not positive definite

  • queue[in] GPU device queue for asynchronous execution

LU Factorization

template<typename scalar_t>
void lapack::getrf_work_size_bytes(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue)

Query workspace sizes for LU factorization on GPU.

Parameters:
  • m[in] Number of rows. m >= 0

  • n[in] Number of columns. n >= 0

  • dA[in] Device pointer to m-by-n matrix (unused, for overloading)

  • ldda[in] Leading dimension. ldda >= max(1,m)

  • dev_work_size[out] Size in bytes for device workspace

  • host_work_size[out] Size in bytes for host workspace

  • queue[in] GPU device queue

template<typename scalar_t>
void lapack::getrf(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, device_pivot_int *dev_ipiv, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue)

LU factorization with partial pivoting on GPU.

Computes LU factorization with row interchanges of an m-by-n matrix A: \( A = P L U \) where P is a permutation matrix, L is lower triangular with unit diagonal, and U is upper triangular.

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:
  • m[in] Number of rows. m >= 0

  • n[in] Number of columns. n >= 0

  • dA[inout] Device pointer to m-by-n matrix A with leading dimension ldda. On exit, factors L and U stored in lower/upper triangles

  • ldda[in] Leading dimension. ldda >= max(1,m)

  • dev_ipiv[out] Device pointer to pivot indices array of length min(m,n)

  • dev_work[in] Device workspace pointer

  • dev_work_size[in] Size of device workspace in bytes

  • host_work[in] Host workspace pointer

  • host_work_size[in] Size of host workspace in bytes

  • dev_info[out] Device pointer to info status:

    • 0: successful

    • i > 0: U(i,i) is exactly zero, factorization complete but U singular

  • queue[in] GPU device queue for asynchronous execution

QR Factorization

template<typename scalar_t>
void lapack::geqrf_work_size_bytes(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue)

Query workspace sizes for QR factorization on GPU.

Parameters:
  • m[in] Number of rows. m >= 0

  • n[in] Number of columns. n >= 0

  • dA[in] Device pointer to m-by-n matrix (unused, for overloading)

  • ldda[in] Leading dimension. ldda >= max(1,m)

  • dev_work_size[out] Size in bytes for device workspace

  • host_work_size[out] Size in bytes for host workspace

  • queue[in] GPU device queue

template<typename scalar_t>
void lapack::geqrf(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, scalar_t *dtau, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue)

QR factorization on GPU.

Computes QR factorization of an m-by-n matrix A: \( A = Q R \) where Q is orthogonal/unitary and R is upper triangular.

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:
  • m[in] Number of rows. m >= 0

  • n[in] Number of columns. n >= 0

  • dA[inout] Device pointer to m-by-n matrix A with leading dimension ldda. On exit, R stored in upper triangle, elementary reflectors below diagonal

  • ldda[in] Leading dimension. ldda >= max(1,m)

  • dtau[out] Device pointer to array of length min(m,n) containing scalar factors of elementary reflectors

  • dev_work[in] Device workspace pointer

  • dev_work_size[in] Size of device workspace in bytes

  • host_work[in] Host workspace pointer

  • host_work_size[in] Size of host workspace in bytes

  • dev_info[out] Device pointer to info status (0 = successful)

  • queue[in] GPU device queue for asynchronous execution

Eigenvalue Decomposition

template<typename scalar_t>
void lapack::heevd_work_size_bytes(lapack::Job jobz, lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, blas::real_type<scalar_t> *dW, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue)

Query workspace sizes for eigenvalue decomposition on GPU.

Parameters:
  • jobz[in]

    • Job::NoVec: Compute eigenvalues only

    • Job::Vec: Compute eigenvalues and eigenvectors

  • uplo[in] Triangle stored in A

  • n[in] Matrix dimension. n >= 0

  • dA[in] Device pointer to n-by-n matrix (unused, for overloading)

  • ldda[in] Leading dimension. ldda >= max(1,n)

  • dW[in] Device pointer to eigenvalue array (unused, for overloading)

  • dev_work_size[out] Size in bytes for device workspace

  • host_work_size[out] Size in bytes for host workspace

  • queue[in] GPU device queue

template<typename scalar_t>
void lapack::heevd(lapack::Job jobz, lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, blas::real_type<scalar_t> *dW, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue)

Eigenvalue decomposition of Hermitian matrix using divide-and-conquer on GPU.

Computes all eigenvalues and optionally eigenvectors of an n-by-n Hermitian matrix A. Uses divide-and-conquer algorithm for improved performance.

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:
  • jobz[in]

    • Job::NoVec: Compute eigenvalues only

    • Job::Vec: Compute eigenvalues and eigenvectors

  • uplo[in]

    • Uplo::Upper: Upper triangle of A is stored

    • Uplo::Lower: Lower triangle of A is stored

  • n[in] Matrix dimension. n >= 0

  • dA[inout] Device pointer to n-by-n Hermitian matrix A with leading dimension ldda. On exit, if jobz = Vec, columns contain orthonormal eigenvectors

  • ldda[in] Leading dimension. ldda >= max(1,n)

  • dW[out] Device pointer to array of length n containing eigenvalues in ascending order

  • dev_work[in] Device workspace pointer

  • dev_work_size[in] Size of device workspace in bytes

  • host_work[in] Host workspace pointer

  • host_work_size[in] Size of host workspace in bytes

  • dev_info[out] Device pointer to info status:

    • 0: successful

    • i > 0: Algorithm failed to converge

  • queue[in] GPU device queue for asynchronous execution

Device Memory Types

LAPACK++ defines integer types matching vendor GPU libraries:

  • device_info_int: Return status from device operations (int for CUDA/ROCm, int64_t otherwise)

  • device_pivot_int: Pivot indices (int64_t for cuSOLVER 11+, int for ROCm, int64_t otherwise)

These types ensure ABI compatibility with vendor-specific libraries while maintaining a unified interface.