Device (GPU) Operations

GPU device management and asynchronous BLAS operations.

Queue Management

class Queue

Queue for executing GPU device routines.

This class provides a unified interface for GPU operations across different backends:

CUDA: Wraps cudaStream_t and cublasHandle_t
HIP/ROCm: Wraps hipStream_t and rocblas_handle
SYCL: Wraps sycl::queue

The Queue manages device memory workspace, stream/handle lifecycle, and supports fork/join parallelism for batch operations.

Subclassed by lapack::Queue

Public Types

using stream_t = void*

Public Functions

Queue(): Default constructor. Uses device 0.

Queue(int device)

Construct queue for specified device.

Parameters:: device – [in] Device ID to use

Queue(int device, stream_t &stream)

Construct queue with specified device and stream.

Parameters:

device – [in] Device ID to use
stream – [in] Pre-existing stream to use

Queue(Queue const&) = delete

Queue &operator=(Queue const&) = delete

~Queue(): Destructor. Synchronizes and frees resources.

inline int device() const

Get device ID associated with this queue.

Returns:: Device ID

void sync()

Synchronize all operations on this queue.

Blocks until all queued operations complete.

inline void *work()

Get pointer to device workspace.

Returns:: Pointer to device workspace memory

template<typename scalar_t> inline size_t work_size() const

Get size of device workspace.

Template Parameters:: scalar_t – Element type for size calculation
Returns:: Size of workspace in number of scalar_t elements

template<typename scalar_t> void work_ensure_size(size_t lwork)

Ensure device workspace is at least specified size.

Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed.

Reallocates if necessary, synchronizing first.

Allocates at least 3 * MaxBatchChunk * sizeof(void*), needed for batch gemm.

Parameters:

lwork – [in] Minimum workspace size in scalar_t elements
lwork – [in] Minimum size of workspace.

Template Parameters:

scalar_t – Element type for size calculation

void fork(int num_streams = MaxForkSize)

Switch from default stream to parallel streams.

Enables concurrent kernel execution across multiple streams.

Parameters:: num_streams – [in] Number of parallel streams (default: MaxForkSize)

void join()

Switch back to the default stream.

Synchronizes all parallel streams and returns to single-stream mode.

void revolve()

Rotate to the next stream in the queue.

Used with fork() to distribute work across parallel streams.

void set_stream(stream_t &in_stream)

Set the stream for this queue.

Parameters:: in_stream – [in] Stream to use

inline stream_t &stream()

Get the current stream.

Returns:: Reference to current stream (may be parallel stream in fork mode)

Device Memory and Batch Operations

See blaspp/include/blas/device.hh and blaspp/include/blas/batch_common.hh for:

Device Memory Functions: - device_malloc() - Allocate device memory - device_free() - Free device memory - device_memcpy() - Copy between host and device - device_malloc_pinned() - Allocate pinned host memory - device_free_pinned() - Free pinned host memory - get_device_count() - Query number of GPU devices

Batch Operation Validators: - gemm_check() - Validate batch GEMM parameters - trsm_check() - Validate batch TRSM parameters - trmm_check() - Validate batch TRMM parameters - hemm_check() - Validate batch HEMM parameters - herk_check() - Validate batch HERK parameters - symm_check() - Validate batch SYMM parameters - syrk_check() - Validate batch SYRK parameters - her2k_check() - Validate batch HER2K parameters - syr2k_check() - Validate batch SYR2K parameters

All batch operations execute asynchronously on device queues and support CUDA, ROCm/HIP, and SYCL backends.