Performance Counting

PAPI performance counters and FLOP/bandwidth calculations.

PAPI Counter Integration

class counter

Performance counter integration for BLAS++.

This class provides integration with PAPI (Performance API) for counting BLAS operations and computing floating-point operation counts. Uses the Scott Meyers Singleton pattern for thread-safe initialization.

The counter system tracks:

Number of calls to each BLAS routine
Dimensions and parameters for each call
Total floating-point operations performed

Usage (when PAPI is available):

// Insert operation into counting set
counter::gemm_type op = {transA, transB, m, n, k};
counter::insert(op, counter::Id::gemm);

// Get total flop count
long long flops = counter::get_flop_count(&atomic_var);

Note

This is essentially a namespace - all public functions are static.

Public Types

typedef void CountingSet

typedef void cset_list_object_t

typedef axpy_type scal_type

typedef axpy_type copy_type

typedef axpy_type swap_type

typedef axpy_type dot_type

typedef axpy_type dotu_type

typedef axpy_type nrm2_type

typedef axpy_type asum_type

typedef axpy_type iamax_type

typedef axpy_type rot_type

typedef axpy_type rotm_type

typedef axpy_type rotg_type

typedef axpy_type rotmg_type

typedef hemv_type symv_type

typedef hemv_type her_type

typedef hemv_type her2_type

typedef hemv_type syr_type

typedef hemv_type syr2_type

typedef trmv_type trsv_type

typedef ger_type geru_type

typedef ger_type gerc_type

typedef hemm_type symm_type

typedef herk_type syrk_type

typedef herk_type syr2k_type

typedef herk_type her2k_type

typedef trmm_type trsm_type

typedef axpy_type dev_axpy_type: Device axpy parameters.

typedef axpy_type dev_scal_type

typedef axpy_type dev_copy_type

typedef axpy_type dev_swap_type

typedef axpy_type dev_dot_type

typedef axpy_type dev_dotu_type

typedef axpy_type dev_nrm2_type

typedef axpy_type dev_asum_type

typedef axpy_type dev_iamax_type

typedef axpy_type dev_rot_type

typedef axpy_type dev_rotm_type

typedef axpy_type dev_rotg_type

typedef axpy_type dev_rotmg_type

typedef gemv_type dev_gemv_type

typedef hemv_type dev_hemv_type

typedef hemv_type dev_symv_type

typedef hemv_type dev_her_type

typedef hemv_type dev_her2_type

typedef hemv_type dev_syr_type

typedef hemv_type dev_syr2_type

typedef trmv_type dev_trmv_type

typedef trmv_type dev_trsv_type

typedef ger_type dev_ger_type

typedef ger_type dev_geru_type

typedef ger_type dev_gerc_type

typedef gemm_type dev_gemm_type

typedef hemm_type dev_hemm_type

typedef hemm_type dev_symm_type

typedef herk_type dev_herk_type

typedef herk_type dev_syrk_type

typedef herk_type dev_syr2k_type

typedef herk_type dev_her2k_type

typedef trmm_type dev_trmm_type

typedef trmm_type dev_trsm_type

struct axpy_type

Parameters for Level 1 BLAS operations (vector length only).

Used by: axpy, scal, copy, swap, dot, dotu, nrm2, asum, iamax, rot, rotm.

Public Members

int64_t n: Vector length.

struct gemv_type

Parameters for gemv (general matrix-vector multiply).

Public Members

blas::Op trans: Transpose operation.

int64_t m

int64_t n: Matrix dimensions.

struct hemv_type

Parameters for Hermitian/symmetric matrix-vector operations.

Used by: hemv, symv, her, her2, syr, syr2.

Public Members

blas::Uplo uplo: Upper or lower triangle.

int64_t n: Matrix dimension.

struct trmv_type

Parameters for triangular matrix-vector operations.

Used by: trmv, trsv.

Public Members

blas::Uplo uplo: Upper or lower triangle.

blas::Op trans: Transpose operation.

blas::Diag diag: Unit or non-unit diagonal.

int64_t n: Matrix dimension.

struct ger_type

Parameters for rank-1 update operations.

Used by: ger, geru, gerc.

Public Members

int64_t m

int64_t n: Matrix dimensions.

struct gemm_type

Parameters for gemm (general matrix-matrix multiply).

Public Members

blas::Op transA

blas::Op transB: Transpose operations for A and B.

int64_t m

int64_t n

int64_t k: Matrix dimensions.

struct hemm_type

Parameters for Hermitian/symmetric matrix-matrix multiply.

Used by: hemm, symm.

Public Members

blas::Side side: Side where Hermitian/symmetric matrix appears.

blas::Uplo uplo: Upper or lower triangle.

int64_t m

int64_t n: Matrix dimensions.

struct herk_type

Parameters for Hermitian/symmetric rank-k and rank-2k updates.

Used by: herk, syrk, syr2k, her2k.

Public Members

blas::Uplo uplo: Upper or lower triangle of result.

blas::Op trans: Transpose operation.

int64_t n

int64_t k: Matrix dimensions.

struct trmm_type

Parameters for triangular matrix-matrix operations.

Used by: trmm, trsm.

Public Members

blas::Side side: Side where triangular matrix appears.

blas::Uplo uplo: Upper or lower triangle.

blas::Op transA: Transpose operation.

blas::Diag diag: Unit or non-unit diagonal.

int64_t m

int64_t n: Matrix dimensions.

struct dev_batch_gemm_type

Parameters for batch gemm on device.

Public Members

blas::Op transA

blas::Op transB: Transpose operations.

int64_t m

int64_t n

int64_t k: Matrix dimensions.

size_t batch_size: Number of matrices in batch.

struct dev_batch_hemm_type

Parameters for batch hemm on device.

Public Members

size_t batch_size: Number of matrices in batch.

FLOP Calculations

template<typename T> class Gflop

Floating-point operation counting in gigaflops.

Template class for computing FLOPs (floating-point operations) for BLAS routines. Accounts for both multiplies and adds, properly handling complex arithmetic via FlopTraits.

Example usage:

// For single precision real gemm
double gflops = Gflop<float>::gemm(m, n, k);

// For single precision complex gemm
double gflops = Gflop<std::complex<float>>::gemm(m, n, k);

Template Parameters:: T – Scalar type (float, double, std::complex<float>, std::complex<double>)

Subclassed by lapack::Gflop< T >

Public Static Functions

static inline double asum(double n)

Giga-FLOPs for asum (sum of absolute values).

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double axpy(double n)

Giga-FLOPs for axpy (y = alpha*x + y).

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double copy(double n)

Giga-FLOPs for copy (no arithmetic operations).

Parameters:: n – [in] Vector length
Returns:: 0 (copy has no FLOPs)

static inline double iamax(double n)

Giga-FLOPs for iamax (index of maximum absolute value).

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double nrm2(double n)

Giga-FLOPs for nrm2 (Euclidean norm).

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double dot(double n)

Giga-FLOPs for dot product.

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double scal(double n)

Giga-FLOPs for scal (vector scaling).

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double swap(double n)

Giga-FLOPs for swap (no arithmetic operations).

Parameters:: n – [in] Vector length
Returns:: 0 (swap has no FLOPs)

static inline double rot(double n)

Giga-FLOPs for Givens rotation.

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double rotm(double n)

Giga-FLOPs for modified Givens rotation.

Parameters:: n – [in] Vector length
Returns:: Gigaflops

static inline double gemv(double m, double n)

Giga-FLOPs for gemv (general matrix-vector multiply).

Parameters:

m – [in] Number of rows
n – [in] Number of columns

Returns:

Gigaflops

static inline double symv(double n)

Giga-FLOPs for symv (symmetric matrix-vector multiply).

Parameters:: n – [in] Matrix dimension
Returns:: Gigaflops

static inline double hemv(double n)

Giga-FLOPs for hemv (Hermitian matrix-vector multiply).

Parameters:: n – [in] Matrix dimension
Returns:: Gigaflops

static inline double trmv(double n)

Giga-FLOPs for trmv (triangular matrix-vector multiply).

Parameters:: n – [in] Matrix dimension
Returns:: Gigaflops

static inline double trsv(double n)

static inline double her(double n)

Giga-FLOPs for her (Hermitian rank-1 update).

Parameters:: n – [in] Matrix dimension
Returns:: Gigaflops

static inline double syr(double n)

static inline double ger(double m, double n)

Giga-FLOPs for ger (general rank-1 update).

Parameters:

m – [in] Number of rows
n – [in] Number of columns

Returns:

Gigaflops

static inline double her2(double n)

Giga-FLOPs for her2 (Hermitian rank-2 update).

Parameters:: n – [in] Matrix dimension
Returns:: Gigaflops

static inline double syr2(double n)

static inline double gemm(double m, double n, double k)

Giga-FLOPs for gemm (C = alpha*op(A)*op(B) + beta*C).

Parameters:

m – [in] Number of rows of C
n – [in] Number of columns of C
k – [in] Inner dimension

Returns:

Gigaflops

static inline double gbmm(double m, double n, double k, double kl, double ku)

Giga-FLOPs for gbmm (banded matrix-matrix multiply).

Parameters:

m – [in] Number of rows
n – [in] Number of columns
k – [in] Inner dimension
kl – [in] Lower bandwidth
ku – [in] Upper bandwidth

Returns:

Gigaflops

static inline double hemm(blas::Side side, double m, double n)

Giga-FLOPs for hemm (Hermitian matrix-matrix multiply).

Parameters:

side – [in] Side where Hermitian matrix appears
m – [in] Number of rows of C
n – [in] Number of columns of C

Returns:

Gigaflops

static inline double hbmm(double m, double n, double kd)

Giga-FLOPs for hbmm (Hermitian banded matrix-matrix multiply).

Parameters:

m – [in] Number of rows
n – [in] Number of columns
kd – [in] Bandwidth

Returns:

Gigaflops

static inline double symm(blas::Side side, double m, double n)

Giga-FLOPs for symm (symmetric matrix-matrix multiply).

Parameters:

side – [in] Side where symmetric matrix appears
m – [in] Number of rows of C
n – [in] Number of columns of C

Returns:

Gigaflops

static inline double herk(double n, double k)

Giga-FLOPs for herk (Hermitian rank-k update).

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigaflops

static inline double syrk(double n, double k)

Giga-FLOPs for syrk (symmetric rank-k update, same as herk).

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigaflops

static inline double her2k(double n, double k)

Giga-FLOPs for her2k (Hermitian rank-2k update).

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigaflops

static inline double syr2k(double n, double k)

Giga-FLOPs for syr2k (symmetric rank-2k update, same as her2k).

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigaflops

static inline double trmm(blas::Side side, double m, double n)

Giga-FLOPs for trmm (triangular matrix-matrix multiply).

Parameters:

side – [in] Side where triangular matrix appears
m – [in] Number of rows of B
n – [in] Number of columns of B

Returns:

Gigaflops

static inline double trsm(blas::Side side, double m, double n)

Giga-FLOPs for trsm (triangular solve, same as trmm).

Parameters:

side – [in] Side where triangular matrix appears
m – [in] Number of rows of B
n – [in] Number of columns of B

Returns:

Gigaflops

Public Static Attributes

static double mul_ops = FlopTraits<T>::mul_ops: Number of real ops per multiply for type T.

static double add_ops = FlopTraits<T>::add_ops: Number of real ops per add for type T.

template<typename T> class FlopTraits

Traits for counting operations per multiply and add.

For real types, one multiply = 1 op, one add = 1 op. For complex types, one complex multiply = 6 real ops (4 muls + 2 adds), one complex add = 2 real ops.

Template Parameters:: T – Scalar type

Public Static Attributes

static double mul_ops = 1: Number of real operations for one multiply.

static double add_ops = 1: Number of real operations for one add.

Bandwidth Calculations

template<typename T> class Gbyte

Data transfer counting in gigabytes.

Template class for computing data transfer (in gigabytes) for BLAS operations. Accounts for reading and writing matrices/vectors based on operation semantics.

Example usage:

double gb = Gbyte<float>::gemm(m, n, k);
double gb_complex = Gbyte<std::complex<float>>::gemm(m, n, k);

Template Parameters:: T – Scalar type (e.g., float, double, std::complex<float>)

Subclassed by lapack::Gbyte< T >

Public Static Functions

static inline double asum(double n)

Data transfer for asum (sum of absolute values).

Reads vector x.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double axpy(double n)

Data transfer for axpy (y = alpha*x + y).

Reads x and y, writes y.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double copy(double n)

Data transfer for copy (y = x).

Reads x, writes y.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double iamax(double n)

Data transfer for iamax (index of max absolute value).

Reads vector x.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double nrm2(double n)

Data transfer for nrm2 (Euclidean norm).

Reads vector x.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double dot(double n)

Data transfer for dot product.

Reads vectors x and y.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double scal(double n)

Data transfer for scal (x = alpha*x).

Reads and writes vector x.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double swap(double n)

Data transfer for swap (exchange x and y).

Reads and writes vectors x and y.

Parameters:: n – [in] Vector length
Returns:: Gigabytes transferred

static inline double gemv(double m, double n)

Data transfer for gemv (y = alpha*A*x + beta*y).

Reads matrix A, vectors x and y, writes y.

Parameters:

m – [in] Number of rows
n – [in] Number of columns

Returns:

Gigabytes transferred

static inline double hemv(double n)

Data transfer for hemv (Hermitian matrix-vector multiply).

Reads Hermitian matrix A (triangle), vector x, writes y.

Parameters:: n – [in] Matrix dimension
Returns:: Gigabytes transferred

static inline double symv(double n)

Data transfer for symv (same as hemv for symmetric).

Parameters:: n – [in] Matrix dimension
Returns:: Gigabytes transferred

static inline double trmv(double n)

Data transfer for trmv/trsv (triangular matrix-vector ops).

Reads triangular matrix A, vector x, writes x.

Parameters:: n – [in] Matrix dimension
Returns:: Gigabytes transferred

static inline double trsv(double n)

Data transfer for trsv (same as trmv).

Giga-FLOPs for trsv (triangular solve, same as trmv).

Parameters:

n – [in] Matrix dimension
n – [in] Matrix dimension

Returns:

Gigabytes transferred

Returns:

Gigaflops

static inline double ger(double m, double n)

Data transfer for ger (rank-1 update A = A + alpha*x*y^T).

Reads A, x, y, writes A.

Parameters:

m – [in] Number of rows
n – [in] Number of columns

Returns:

Gigabytes transferred

static inline double her(double n)

Data transfer for her/syr (Hermitian/symmetric rank-1 update).

Reads triangular A, vector x, writes triangular A.

Parameters:: n – [in] Matrix dimension
Returns:: Gigabytes transferred

static inline double syr(double n)

Data transfer for syr (same as her for symmetric).

Giga-FLOPs for syr (symmetric rank-1 update, same as her).

Parameters:

n – [in] Matrix dimension
n – [in] Matrix dimension

Returns:

Gigabytes transferred

Returns:

Gigaflops

static inline double her2(double n)

Data transfer for her2/syr2 (Hermitian/symmetric rank-2 update).

Reads triangular A, vectors x and y, writes triangular A.

Parameters:: n – [in] Matrix dimension
Returns:: Gigabytes transferred

static inline double syr2(double n)

Data transfer for syr2 (same as her2 for symmetric).

Giga-FLOPs for syr2 (symmetric rank-2 update, same as her2).

Parameters:

n – [in] Matrix dimension
n – [in] Matrix dimension

Returns:

Gigabytes transferred

Returns:

Gigaflops

static inline double copy_2d(double m, double n)

Data transfer for 2D matrix copy.

Reads matrix A, writes matrix B.

Parameters:

m – [in] Number of rows
n – [in] Number of columns

Returns:

Gigabytes transferred

static inline double gemm(double m, double n, double k)

Data transfer for gemm (C = alpha*A*B + beta*C).

Reads A, B, C, writes C.

Parameters:

m – [in] Number of rows of C
n – [in] Number of columns of C
k – [in] Inner dimension

Returns:

Gigabytes transferred

static inline double hemm(blas::Side side, double m, double n)

Data transfer for hemm (Hermitian matrix-matrix multiply).

Reads Hermitian A, matrices B and C, writes C.

Parameters:

side – [in] Side where Hermitian matrix appears
m – [in] Number of rows of C
n – [in] Number of columns of C

Returns:

Gigabytes transferred

static inline double symm(blas::Side side, double m, double n)

Data transfer for symm (same as hemm for symmetric).

Parameters:

side – [in] Side where symmetric matrix appears
m – [in] Number of rows of C
n – [in] Number of columns of C

Returns:

Gigabytes transferred

static inline double herk(double n, double k)

Data transfer for herk (Hermitian rank-k update).

Reads matrix A, Hermitian C, writes C.

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigabytes transferred

static inline double syrk(double n, double k)

Data transfer for syrk (same as herk for symmetric).

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigabytes transferred

static inline double her2k(double n, double k)

Data transfer for her2k (Hermitian rank-2k update).

Reads matrices A and B, Hermitian C, writes C.

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigabytes transferred

static inline double syr2k(double n, double k)

Data transfer for syr2k (same as her2k for symmetric).

Parameters:

n – [in] Dimension of C
k – [in] Inner dimension

Returns:

Gigabytes transferred

static inline double trmm(blas::Side side, double m, double n)

Data transfer for trmm/trsm (triangular matrix-matrix ops).

Reads triangular A, matrix B, writes B.

Parameters:

side – [in] Side where triangular matrix appears
m – [in] Number of rows of B
n – [in] Number of columns of B

Returns:

Gigabytes transferred

static inline double trsm(blas::Side side, double m, double n)

Data transfer for trsm (same as trmm).

Parameters:

side – [in] Side where triangular matrix appears
m – [in] Number of rows of B
n – [in] Number of columns of B

Returns:

Gigabytes transferred