Device (GPU) Operations
GPU device management and asynchronous BLAS operations.
Queue Management
-
class Queue
Queue for executing GPU device routines.
This class provides a unified interface for GPU operations across different backends:
CUDA: Wraps cudaStream_t and cublasHandle_t
HIP/ROCm: Wraps hipStream_t and rocblas_handle
SYCL: Wraps sycl::queue
The Queue manages device memory workspace, stream/handle lifecycle, and supports fork/join parallelism for batch operations.
Subclassed by lapack::Queue
Public Types
-
using stream_t = void*
Public Functions
-
Queue()
Default constructor. Uses device 0.
-
Queue(int device)
Construct queue for specified device.
- Parameters:
device – [in] Device ID to use
-
Queue(int device, stream_t &stream)
Construct queue with specified device and stream.
- Parameters:
device – [in] Device ID to use
stream – [in] Pre-existing stream to use
-
~Queue()
Destructor. Synchronizes and frees resources.
-
inline int device() const
Get device ID associated with this queue.
- Returns:
Device ID
-
void sync()
Synchronize all operations on this queue.
Blocks until all queued operations complete.
-
inline void *work()
Get pointer to device workspace.
- Returns:
Pointer to device workspace memory
-
template<typename scalar_t>
inline size_t work_size() const Get size of device workspace.
- Template Parameters:
scalar_t – Element type for size calculation
- Returns:
Size of workspace in number of scalar_t elements
-
template<typename scalar_t>
void work_ensure_size(size_t lwork) Ensure device workspace is at least specified size.
Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed.
Reallocates if necessary, synchronizing first.
Allocates at least 3 * MaxBatchChunk * sizeof(void*), needed for batch gemm.
- Parameters:
lwork – [in] Minimum workspace size in scalar_t elements
lwork – [in] Minimum size of workspace.
- Template Parameters:
scalar_t – Element type for size calculation
-
void fork(int num_streams = MaxForkSize)
Switch from default stream to parallel streams.
Enables concurrent kernel execution across multiple streams.
- Parameters:
num_streams – [in] Number of parallel streams (default: MaxForkSize)
-
void join()
Switch back to the default stream.
Synchronizes all parallel streams and returns to single-stream mode.
-
void revolve()
Rotate to the next stream in the queue.
Used with fork() to distribute work across parallel streams.
Device Memory and Batch Operations
See blaspp/include/blas/device.hh and blaspp/include/blas/batch_common.hh for:
Device Memory Functions:
- device_malloc() - Allocate device memory
- device_free() - Free device memory
- device_memcpy() - Copy between host and device
- device_malloc_pinned() - Allocate pinned host memory
- device_free_pinned() - Free pinned host memory
- get_device_count() - Query number of GPU devices
Batch Operation Validators:
- gemm_check() - Validate batch GEMM parameters
- trsm_check() - Validate batch TRSM parameters
- trmm_check() - Validate batch TRMM parameters
- hemm_check() - Validate batch HEMM parameters
- herk_check() - Validate batch HERK parameters
- symm_check() - Validate batch SYMM parameters
- syrk_check() - Validate batch SYRK parameters
- her2k_check() - Validate batch HER2K parameters
- syr2k_check() - Validate batch SYR2K parameters
All batch operations execute asynchronously on device queues and support CUDA, ROCm/HIP, and SYCL backends.