Device (GPU) Operations

GPU device management and asynchronous LAPACK operations using cuSOLVER, rocSOLVER, or SYCL.

Queue Management

class Queue : public blas::Queue 

GPU device queue for asynchronous LAPACK operations.

Extends blas::Queue with cuSOLVER/rocSOLVER handle management. Provides asynchronous execution of LAPACK routines on GPU devices including factorizations (potrf, getrf), solves (trsm), and eigenvalue/SVD computations.

Public Functions

inline Queue()

inline Queue(int device)

inline ~Queue()

Queue(Queue const&) = delete

Queue &operator=(Queue const&) = delete

GPU Operations

Cholesky Factorization

template<typename scalar_t> void lapack::potrf(lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, device_info_int *dev_info, lapack::Queue &queue)

Cholesky factorization of Hermitian positive definite matrix on GPU.

Computes the Cholesky factorization of an n-by-n Hermitian positive definite matrix A on device memory:

If uplo = Upper: \( A = U^H U \)
If uplo = Lower: \( A = L L^H \)

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:

uplo – [in] Whether upper or lower triangle of A is stored
n – [in] Matrix dimension. n >= 0
dA – [inout] Device pointer to n-by-n matrix A with leading dimension ldda On exit, factor U or L stored in respective triangle
ldda – [in] Leading dimension of dA. ldda >= max(1,n)
dev_info – [out] Device pointer to info status:
- 0: successful
- i > 0: U(i,i) or L(i,i) is zero, matrix not positive definite
queue – [in] GPU device queue for asynchronous execution

LU Factorization

template<typename scalar_t> void lapack::getrf_work_size_bytes(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue)

Query workspace sizes for LU factorization on GPU.

Parameters:

m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [in] Device pointer to m-by-n matrix (unused, for overloading)
ldda – [in] Leading dimension. ldda >= max(1,m)
dev_work_size – [out] Size in bytes for device workspace
host_work_size – [out] Size in bytes for host workspace
queue – [in] GPU device queue

template<typename scalar_t> void lapack::getrf(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, device_pivot_int *dev_ipiv, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue)

LU factorization with partial pivoting on GPU.

Computes LU factorization with row interchanges of an m-by-n matrix A: \( A = P L U \) where P is a permutation matrix, L is lower triangular with unit diagonal, and U is upper triangular.

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:

m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [inout] Device pointer to m-by-n matrix A with leading dimension ldda. On exit, factors L and U stored in lower/upper triangles
ldda – [in] Leading dimension. ldda >= max(1,m)
dev_ipiv – [out] Device pointer to pivot indices array of length min(m,n)
dev_work – [in] Device workspace pointer
dev_work_size – [in] Size of device workspace in bytes
host_work – [in] Host workspace pointer
host_work_size – [in] Size of host workspace in bytes
dev_info – [out] Device pointer to info status:
- 0: successful
- i > 0: U(i,i) is exactly zero, factorization complete but U singular
queue – [in] GPU device queue for asynchronous execution

QR Factorization

template<typename scalar_t> void lapack::geqrf_work_size_bytes(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue)

Query workspace sizes for QR factorization on GPU.

Parameters:

m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [in] Device pointer to m-by-n matrix (unused, for overloading)
ldda – [in] Leading dimension. ldda >= max(1,m)
dev_work_size – [out] Size in bytes for device workspace
host_work_size – [out] Size in bytes for host workspace
queue – [in] GPU device queue

template<typename scalar_t> void lapack::geqrf(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, scalar_t *dtau, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue)

QR factorization on GPU.

Computes QR factorization of an m-by-n matrix A: \( A = Q R \) where Q is orthogonal/unitary and R is upper triangular.

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:

m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [inout] Device pointer to m-by-n matrix A with leading dimension ldda. On exit, R stored in upper triangle, elementary reflectors below diagonal
ldda – [in] Leading dimension. ldda >= max(1,m)
dtau – [out] Device pointer to array of length min(m,n) containing scalar factors of elementary reflectors
dev_work – [in] Device workspace pointer
dev_work_size – [in] Size of device workspace in bytes
host_work – [in] Host workspace pointer
host_work_size – [in] Size of host workspace in bytes
dev_info – [out] Device pointer to info status (0 = successful)
queue – [in] GPU device queue for asynchronous execution

Eigenvalue Decomposition

template<typename scalar_t> void lapack::heevd_work_size_bytes(lapack::Job jobz, lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, blas::real_type<scalar_t> *dW, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue)

Query workspace sizes for eigenvalue decomposition on GPU.

Parameters:

jobz – [in]
- Job::NoVec: Compute eigenvalues only
- Job::Vec: Compute eigenvalues and eigenvectors
uplo – [in] Triangle stored in A
n – [in] Matrix dimension. n >= 0
dA – [in] Device pointer to n-by-n matrix (unused, for overloading)
ldda – [in] Leading dimension. ldda >= max(1,n)
dW – [in] Device pointer to eigenvalue array (unused, for overloading)
dev_work_size – [out] Size in bytes for device workspace
host_work_size – [out] Size in bytes for host workspace
queue – [in] GPU device queue

template<typename scalar_t> void lapack::heevd(lapack::Job jobz, lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, blas::real_type<scalar_t> *dW, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue)

Eigenvalue decomposition of Hermitian matrix using divide-and-conquer on GPU.

Computes all eigenvalues and optionally eigenvectors of an n-by-n Hermitian matrix A. Uses divide-and-conquer algorithm for improved performance.

Template Parameters:

scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>

Parameters:

jobz – [in]
- Job::NoVec: Compute eigenvalues only
- Job::Vec: Compute eigenvalues and eigenvectors
uplo – [in]
- Uplo::Upper: Upper triangle of A is stored
- Uplo::Lower: Lower triangle of A is stored
n – [in] Matrix dimension. n >= 0
dA – [inout] Device pointer to n-by-n Hermitian matrix A with leading dimension ldda. On exit, if jobz = Vec, columns contain orthonormal eigenvectors
ldda – [in] Leading dimension. ldda >= max(1,n)
dW – [out] Device pointer to array of length n containing eigenvalues in ascending order
dev_work – [in] Device workspace pointer
dev_work_size – [in] Size of device workspace in bytes
host_work – [in] Host workspace pointer
host_work_size – [in] Size of host workspace in bytes
dev_info – [out] Device pointer to info status:
- 0: successful
- i > 0: Algorithm failed to converge
queue – [in] GPU device queue for asynchronous execution

Device Memory Types

LAPACK++ defines integer types matching vendor GPU libraries:

device_info_int: Return status from device operations (int for CUDA/ROCm, int64_t otherwise)
device_pivot_int: Pivot indices (int64_t for cuSOLVER 11+, int for ROCm, int64_t otherwise)

These types ensure ABI compatibility with vendor-specific libraries while maintaining a unified interface.