Device (GPU) Operations

GPU device management and asynchronous BLAS operations.

Queue Management

class Queue

Queue for executing GPU device routines.

This class provides a unified interface for GPU operations across different backends:

  • CUDA: Wraps cudaStream_t and cublasHandle_t

  • HIP/ROCm: Wraps hipStream_t and rocblas_handle

  • SYCL: Wraps sycl::queue

The Queue manages device memory workspace, stream/handle lifecycle, and supports fork/join parallelism for batch operations.

Subclassed by lapack::Queue

Public Types

using stream_t = void*

Public Functions

Queue()

Default constructor. Uses device 0.

Queue(int device)

Construct queue for specified device.

Parameters:

device[in] Device ID to use

Queue(int device, stream_t &stream)

Construct queue with specified device and stream.

Parameters:
  • device[in] Device ID to use

  • stream[in] Pre-existing stream to use

Queue(Queue const&) = delete
Queue &operator=(Queue const&) = delete
~Queue()

Destructor. Synchronizes and frees resources.

inline int device() const

Get device ID associated with this queue.

Returns:

Device ID

void sync()

Synchronize all operations on this queue.

Blocks until all queued operations complete.

inline void *work()

Get pointer to device workspace.

Returns:

Pointer to device workspace memory

template<typename scalar_t>
inline size_t work_size() const

Get size of device workspace.

Template Parameters:

scalar_t – Element type for size calculation

Returns:

Size of workspace in number of scalar_t elements

template<typename scalar_t>
void work_ensure_size(size_t lwork)

Ensure device workspace is at least specified size.

Ensures GPU device workspace is of size at least lwork elements of scalar_t, synchronizing and reallocating if needed.

Reallocates if necessary, synchronizing first.

Allocates at least 3 * MaxBatchChunk * sizeof(void*), needed for batch gemm.

Parameters:
  • lwork[in] Minimum workspace size in scalar_t elements

  • lwork[in] Minimum size of workspace.

Template Parameters:

scalar_t – Element type for size calculation

void fork(int num_streams = MaxForkSize)

Switch from default stream to parallel streams.

Enables concurrent kernel execution across multiple streams.

Parameters:

num_streams[in] Number of parallel streams (default: MaxForkSize)

void join()

Switch back to the default stream.

Synchronizes all parallel streams and returns to single-stream mode.

void revolve()

Rotate to the next stream in the queue.

Used with fork() to distribute work across parallel streams.

void set_stream(stream_t &in_stream)

Set the stream for this queue.

Parameters:

in_stream[in] Stream to use

inline stream_t &stream()

Get the current stream.

Returns:

Reference to current stream (may be parallel stream in fork mode)

Device Memory and Batch Operations

See blaspp/include/blas/device.hh and blaspp/include/blas/batch_common.hh for:

Device Memory Functions: - device_malloc() - Allocate device memory - device_free() - Free device memory - device_memcpy() - Copy between host and device - device_malloc_pinned() - Allocate pinned host memory - device_free_pinned() - Free pinned host memory - get_device_count() - Query number of GPU devices

Batch Operation Validators: - gemm_check() - Validate batch GEMM parameters - trsm_check() - Validate batch TRSM parameters - trmm_check() - Validate batch TRMM parameters - hemm_check() - Validate batch HEMM parameters - herk_check() - Validate batch HERK parameters - symm_check() - Validate batch SYMM parameters - syrk_check() - Validate batch SYRK parameters - her2k_check() - Validate batch HER2K parameters - syr2k_check() - Validate batch SYR2K parameters

All batch operations execute asynchronously on device queues and support CUDA, ROCm/HIP, and SYCL backends.