Device (GPU) Operations
GPU device management and asynchronous LAPACK operations using cuSOLVER, rocSOLVER, or SYCL.
Queue Management
-
class Queue : public blas::Queue
GPU device queue for asynchronous LAPACK operations.
Extends blas::Queue with cuSOLVER/rocSOLVER handle management. Provides asynchronous execution of LAPACK routines on GPU devices including factorizations (potrf, getrf), solves (trsm), and eigenvalue/SVD computations.
GPU Operations
Cholesky Factorization
-
template<typename scalar_t>
void lapack::potrf(lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, device_info_int *dev_info, lapack::Queue &queue) Cholesky factorization of Hermitian positive definite matrix on GPU.
Computes the Cholesky factorization of an n-by-n Hermitian positive definite matrix A on device memory:
If uplo = Upper: \( A = U^H U \)
If uplo = Lower: \( A = L L^H \)
- Template Parameters:
scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>
- Parameters:
uplo – [in] Whether upper or lower triangle of A is stored
n – [in] Matrix dimension. n >= 0
dA – [inout] Device pointer to n-by-n matrix A with leading dimension ldda On exit, factor U or L stored in respective triangle
ldda – [in] Leading dimension of dA. ldda >= max(1,n)
dev_info – [out] Device pointer to info status:
0: successful
i > 0: U(i,i) or L(i,i) is zero, matrix not positive definite
queue – [in] GPU device queue for asynchronous execution
LU Factorization
-
template<typename scalar_t>
void lapack::getrf_work_size_bytes(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue) Query workspace sizes for LU factorization on GPU.
- Parameters:
m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [in] Device pointer to m-by-n matrix (unused, for overloading)
ldda – [in] Leading dimension. ldda >= max(1,m)
dev_work_size – [out] Size in bytes for device workspace
host_work_size – [out] Size in bytes for host workspace
queue – [in] GPU device queue
-
template<typename scalar_t>
void lapack::getrf(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, device_pivot_int *dev_ipiv, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue) LU factorization with partial pivoting on GPU.
Computes LU factorization with row interchanges of an m-by-n matrix A: \( A = P L U \) where P is a permutation matrix, L is lower triangular with unit diagonal, and U is upper triangular.
- Template Parameters:
scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>
- Parameters:
m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [inout] Device pointer to m-by-n matrix A with leading dimension ldda. On exit, factors L and U stored in lower/upper triangles
ldda – [in] Leading dimension. ldda >= max(1,m)
dev_ipiv – [out] Device pointer to pivot indices array of length min(m,n)
dev_work – [in] Device workspace pointer
dev_work_size – [in] Size of device workspace in bytes
host_work – [in] Host workspace pointer
host_work_size – [in] Size of host workspace in bytes
dev_info – [out] Device pointer to info status:
0: successful
i > 0: U(i,i) is exactly zero, factorization complete but U singular
queue – [in] GPU device queue for asynchronous execution
QR Factorization
-
template<typename scalar_t>
void lapack::geqrf_work_size_bytes(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue) Query workspace sizes for QR factorization on GPU.
- Parameters:
m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [in] Device pointer to m-by-n matrix (unused, for overloading)
ldda – [in] Leading dimension. ldda >= max(1,m)
dev_work_size – [out] Size in bytes for device workspace
host_work_size – [out] Size in bytes for host workspace
queue – [in] GPU device queue
-
template<typename scalar_t>
void lapack::geqrf(int64_t m, int64_t n, scalar_t *dA, int64_t ldda, scalar_t *dtau, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue) QR factorization on GPU.
Computes QR factorization of an m-by-n matrix A: \( A = Q R \) where Q is orthogonal/unitary and R is upper triangular.
- Template Parameters:
scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>
- Parameters:
m – [in] Number of rows. m >= 0
n – [in] Number of columns. n >= 0
dA – [inout] Device pointer to m-by-n matrix A with leading dimension ldda. On exit, R stored in upper triangle, elementary reflectors below diagonal
ldda – [in] Leading dimension. ldda >= max(1,m)
dtau – [out] Device pointer to array of length min(m,n) containing scalar factors of elementary reflectors
dev_work – [in] Device workspace pointer
dev_work_size – [in] Size of device workspace in bytes
host_work – [in] Host workspace pointer
host_work_size – [in] Size of host workspace in bytes
dev_info – [out] Device pointer to info status (0 = successful)
queue – [in] GPU device queue for asynchronous execution
Eigenvalue Decomposition
-
template<typename scalar_t>
void lapack::heevd_work_size_bytes(lapack::Job jobz, lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, blas::real_type<scalar_t> *dW, size_t *dev_work_size, size_t *host_work_size, lapack::Queue &queue) Query workspace sizes for eigenvalue decomposition on GPU.
- Parameters:
jobz – [in]
Job::NoVec: Compute eigenvalues only
Job::Vec: Compute eigenvalues and eigenvectors
uplo – [in] Triangle stored in A
n – [in] Matrix dimension. n >= 0
dA – [in] Device pointer to n-by-n matrix (unused, for overloading)
ldda – [in] Leading dimension. ldda >= max(1,n)
dW – [in] Device pointer to eigenvalue array (unused, for overloading)
dev_work_size – [out] Size in bytes for device workspace
host_work_size – [out] Size in bytes for host workspace
queue – [in] GPU device queue
-
template<typename scalar_t>
void lapack::heevd(lapack::Job jobz, lapack::Uplo uplo, int64_t n, scalar_t *dA, int64_t ldda, blas::real_type<scalar_t> *dW, void *dev_work, size_t dev_work_size, void *host_work, size_t host_work_size, device_info_int *dev_info, lapack::Queue &queue) Eigenvalue decomposition of Hermitian matrix using divide-and-conquer on GPU.
Computes all eigenvalues and optionally eigenvectors of an n-by-n Hermitian matrix A. Uses divide-and-conquer algorithm for improved performance.
- Template Parameters:
scalar_t – Matrix element type: float, double, std::complex<float>, std::complex<double>
- Parameters:
jobz – [in]
Job::NoVec: Compute eigenvalues only
Job::Vec: Compute eigenvalues and eigenvectors
uplo – [in]
Uplo::Upper: Upper triangle of A is stored
Uplo::Lower: Lower triangle of A is stored
n – [in] Matrix dimension. n >= 0
dA – [inout] Device pointer to n-by-n Hermitian matrix A with leading dimension ldda. On exit, if jobz = Vec, columns contain orthonormal eigenvectors
ldda – [in] Leading dimension. ldda >= max(1,n)
dW – [out] Device pointer to array of length n containing eigenvalues in ascending order
dev_work – [in] Device workspace pointer
dev_work_size – [in] Size of device workspace in bytes
host_work – [in] Host workspace pointer
host_work_size – [in] Size of host workspace in bytes
dev_info – [out] Device pointer to info status:
0: successful
i > 0: Algorithm failed to converge
queue – [in] GPU device queue for asynchronous execution
Device Memory Types
LAPACK++ defines integer types matching vendor GPU libraries:
device_info_int: Return status from device operations (int for CUDA/ROCm, int64_t otherwise)device_pivot_int: Pivot indices (int64_t for cuSOLVER 11+, int for ROCm, int64_t otherwise)
These types ensure ABI compatibility with vendor-specific libraries while maintaining a unified interface.