Example 01: Matrix Construction

This example demonstrates the fundamental operations for creating and managing SLATE matrices.

Key Concepts

  1. Constructors: Creating empty matrices with specific dimensions and distribution.

  2. Tile Insertion: Allocating memory for tiles on Host (CPU) or Devices (GPU).

  3. User Data: Wrapping existing data pointers (e.g., from ScaLAPACK) into a SLATE matrix.

  4. Transposition: Creating transposed views of matrices without copying data.

  5. Element Access: Iterating over tiles and elements efficiently.

C++ Example

Matrix Construction (Lines 18-44)

// Create an empty matrix (2D block cyclic layout, p x q grid,
// no tiles allocated, square nb x nb tiles)
slate::Matrix<scalar_type>
    A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );

Here, we construct a standard general matrix A. Note that:

  • m and n are the global dimensions (rows, columns).

  • nb is the block size (tile size).

  • grid_p and grid_q define the MPI process grid dimensions.

  • MPI_COMM_WORLD is the communicator.

  • Important: This constructor defines the matrix structure and distribution, but does not allocate memory for the matrix elements (tiles). The matrix is initially empty.

We can also create matrices with rectangular tiles (specifying mb and nb) or triangular matrices (slate::TriangularMatrix). A.emptyLike() (line 42) is a convenient way to create a new matrix A2 with the same structure and distribution as A but without any data allocated.

Allocating Memory (Lines 47-68)

// Insert tiles on the CPU host.
A.insertLocalTiles( slate::Target::Host );

Since the constructor creates an empty shell, we must explicitly allocate memory for the tiles that belong to this MPI rank. insertLocalTiles is the standard way to do this.

  • Target::Host allocates memory on the CPU.

  • Target::Devices (seen in lines 71-92) allocates memory on GPU devices if available.

The loop shown in lines 63-66 demonstrates what insertLocalTiles effectively does under the hood: it iterates over the matrix structure, checks if a tile is local to the current rank, and if so, inserts (allocates) it.

Wrapping Existing Data (Lines 97-132)

// Attach user allocated tiles, from pointers in data( i, j )
// with local stride lld between columns.
for (int64_t j = 0; j < A.nt(); ++j) {
    for (int64_t i = 0; i < A.mt(); ++i) {
        if (A.tileIsLocal( i, j ))
            A.tileInsert( i, j, data( i, j ), lld );
    }
}

If you already have data allocated (e.g., from a legacy application), you can wrap it into a SLATE matrix without copying. The data lambda in this example simulates calculating the pointer to the start of each tile in a ScaLAPACK-style array. tileInsert is then called with the pointer and leading dimension (lld).

ScaLAPACK Conversion (Lines 135-162)

auto A = slate::Matrix<scalar_type>::fromScaLAPACK( ... );

This is the simplified, preferred way to wrap legacy ScaLAPACK data. It handles all the index calculations and tile insertions automatically.

Transposition (Lines 165-185)

auto AT = transpose( A );
auto AH = conj_transpose( A );

SLATE operations like transposition are typically metadata operations. AT is a view of A. No data is copied or moved.

  • AT has its operation flag set to Op::Trans.

  • Accessing AT(i, j) effectively accesses A(j, i) with transposition applied on the fly.

  • This allows efficient passing of transposed arguments to BLAS routines (like gemm) without overhead.

Element Access (Lines 189-254)

if (A.tileIsLocal( i, j )) {
    // ...
    slate::Tile<scalar_type> T = A( i, j, slate::HostNum );
    for (int64_t jj = 0; jj < T.nb(); ++jj) {
        for (int64_t ii = 0; ii < T.mb(); ++ii) {
            T.at( ii, jj ) = ...
        }
    }
}

Direct element access in distributed memory requires care:

  1. We iterate over global tiles (i, j).

  2. We check A.tileIsLocal(i, j) to ensure the current rank owns the data.

  3. We acquire the tile T on the host.

  4. We iterate within the tile using local indices ii, jj.

  5. T.at(ii, jj) provides safe access (with bounds checking in debug mode), while T.data() (shown in elements2) provides raw pointer access for maximum performance in inner loops.

  1// ex01_matrix.cc
  2// create 2000 x 1000 matrix on 2 x 2 MPI process grid
  3
  4/// !!!   Lines between `//---------- begin label`          !!!
  5/// !!!             and `//---------- end label`            !!!
  6/// !!!   are included in the SLATE Users' Guide.           !!!
  7
  8#include <slate/slate.hh>
  9
 10#include "util.hh"
 11
 12int mpi_size = 0;
 13int mpi_rank = 0;
 14int grid_p = 0;
 15int grid_q = 0;
 16
 17//------------------------------------------------------------------------------
 18template <typename scalar_type>
 19void test_constructor()
 20{
 21    print_func( mpi_rank );
 22
 23    int64_t m=2000, n=1000, mb=128, nb=256;
 24
 25    //---------- begin constructor
 26    // Create an empty matrix (2D block cyclic layout, p x q grid,
 27    // no tiles allocated, square nb x nb tiles)
 28    slate::Matrix<scalar_type>
 29        A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
 30
 31    // Create an empty matrix (2D block cyclic layout, p x q grid,
 32    // no tiles allocated, rectangular mb x nb tiles)
 33    slate::Matrix<scalar_type>
 34        B( m, n, mb, nb, grid_p, grid_q, MPI_COMM_WORLD );
 35
 36    // Create an empty TriangularMatrix (2D block cyclic layout, no tiles)
 37    slate::TriangularMatrix<scalar_type>
 38        T( slate::Uplo::Lower, slate::Diag::NonUnit, n, nb,
 39           grid_p, grid_q, MPI_COMM_WORLD );
 40
 41    // Create an empty matrix based on another matrix structure.
 42    slate::Matrix<scalar_type> A2 = A.emptyLike();
 43    //---------- end constructor
 44}
 45
 46//------------------------------------------------------------------------------
 47template <typename scalar_type>
 48void test_insert_host()
 49{
 50    print_func( mpi_rank );
 51
 52    int64_t m=2000, n=1000, nb=256;
 53
 54    //---------- begin insert_host
 55    // Create two empty matrices.
 56    slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
 57    auto A2 = A.emptyLike();
 58
 59    // Insert tiles on the CPU host.
 60    A.insertLocalTiles( slate::Target::Host );
 61
 62    // A2.insertLocalTiles( slate::Target::Host ) is equivalent to:
 63    for (int64_t j = 0; j < A2.nt(); ++j)
 64        for (int64_t i = 0; i < A2.mt(); ++i)
 65            if (A2.tileIsLocal( i, j ))
 66                A2.tileInsert( i, j, slate::HostNum );
 67    //---------- end insert_host
 68}
 69
 70//------------------------------------------------------------------------------
 71template <typename scalar_type>
 72void test_insert_device()
 73{
 74    print_func( mpi_rank );
 75
 76    int64_t m=2000, n=1000, nb=256;
 77
 78    //---------- begin insert_device
 79    // Create two empty matrices.
 80    slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
 81    auto A2 = A.emptyLike();
 82
 83    // Insert tiles on the GPU devices.
 84    A.insertLocalTiles( slate::Target::Devices );
 85
 86    // A2.insertLocalTiles( slate::Target::Devices ) is equivalent to:
 87    for (int64_t j = 0; j < A2.nt(); ++j)
 88        for (int64_t i = 0; i < A2.mt(); ++i)
 89            if (A2.tileIsLocal( i, j ))
 90                A2.tileInsert( i, j, A2.tileDevice( i, j ) );
 91    //---------- end insert_device
 92}
 93
 94//------------------------------------------------------------------------------
 95// This example uses ScaLAPACK data just as an example of user-defined data,
 96// but it is easier to use fromScaLAPACK to create a SLATE matrix; see below.
 97template <typename scalar_type>
 98void test_insert_user()
 99{
100    print_func( mpi_rank );
101
102    int64_t m=2000, n=1000, nb=256;
103
104    // User-allocated data, in ScaLAPACK format (assuming column-major grid).
105    int myrow = mpi_rank % grid_p;
106    int mycol = mpi_rank / grid_p;
107    int64_t mlocal = slate::num_local_rows_cols( m, nb, myrow, 0, grid_p );
108    int64_t nlocal = slate::num_local_rows_cols( n, nb, myrow, 0, grid_p );
109    int64_t lld = mlocal; // local leading dimension
110    scalar_type* A_data = new scalar_type[ lld*nlocal ];
111
112    // lambda to get tile (i, j) in ScaLAPACK data.
113    auto data = [A_data, nb, lld]( int64_t i, int64_t j ) {
114        int64_t ii_local = slate::global2local( i*nb, nb, grid_p );
115        int64_t jj_local = slate::global2local( j*nb, nb, grid_q );
116        return &A_data[ ii_local + jj_local*lld ];
117    };
118
119    //---------- begin insert_user
120    // Create an empty matrix (2D block cyclic layout, no tiles).
121    slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
122
123    // Attach user allocated tiles, from pointers in data( i, j )
124    // with local stride lld between columns.
125    for (int64_t j = 0; j < A.nt(); ++j) {
126        for (int64_t i = 0; i < A.mt(); ++i) {
127            if (A.tileIsLocal( i, j ))
128                A.tileInsert( i, j, data( i, j ), lld );
129        }
130    }
131    //---------- end insert_user
132}
133
134//------------------------------------------------------------------------------
135template <typename scalar_type>
136void test_fromScaLAPACK()
137{
138    print_func( mpi_rank );
139
140    int64_t m=2000, n=1000, nb=256;
141
142    //---------- begin fromScaLAPACK
143    // User-allocated data, in ScaLAPACK format (assuming column-major grid).
144    int myrow = mpi_rank % grid_p;
145    int mycol = mpi_rank / grid_p;
146    int64_t mlocal = slate::num_local_rows_cols( m, nb, myrow, 0, grid_p );
147    int64_t nlocal = slate::num_local_rows_cols( n, nb, myrow, 0, grid_p );
148    int64_t lld = mlocal; // local leading dimension
149    scalar_type* A_data = new scalar_type[ lld*nlocal ];
150
151    // Create matrix from ScaLAPACK data.
152    auto A = slate::Matrix<scalar_type>::fromScaLAPACK(
153        m, n,                   // global matrix dimensions
154        A_data,                 // local ScaLAPACK array data
155        lld,                    // local leading dimension (column stride) for data
156        nb, nb,                 // block size
157        slate::GridOrder::Col,  // col- or row-major MPI process grid
158        grid_p, grid_q,         // MPI process grid
159        MPI_COMM_WORLD          // MPI communicator
160    );
161    //---------- end fromScaLAPACK
162}
163
164//------------------------------------------------------------------------------
165template <typename scalar_type>
166void test_transpose()
167{
168    print_func( mpi_rank );
169
170    int64_t m=2000, n=1000, nb=256;
171
172    //---------- begin transpose
173    slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
174
175    // Transpose
176    // AT is a transposed view of A, with flag AT.op() == Op::Trans.
177    // The Tile AT( i, j ) == transpose( A( j, i ) ).
178    auto AT = transpose( A );
179
180    // Conjugate transpose
181    // AH is a conjugate-transposed view of A, with flag AH.op() == Op::ConjTrans.
182    // The Tile AH( i, j ) == conj_transpose( A( j, i ) ).
183    auto AH = conj_transpose( A );
184    //---------- end transpose
185}
186
187//------------------------------------------------------------------------------
188template <typename scalar_type>
189void test_elements()
190{
191    using slate::LayoutConvert;
192
193    print_func( mpi_rank );
194
195    int64_t m=2000, n=1000, nb=256;
196
197    //---------- begin elements
198    slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
199    A.insertLocalTiles( slate::Target::Host );
200
201    // Loop over tiles in A.
202    int64_t jj_global = 0;
203    for (int64_t j = 0; j < A.nt(); ++j) {
204        int64_t ii_global = 0;
205        for (int64_t i = 0; i < A.mt(); ++i) {
206            if (A.tileIsLocal( i, j )) {
207                // For local tiles, loop over entries in tile.
208                // Make sure CPU tile exists for writing.
209                A.tileGetForWriting( i, j, slate::HostNum, LayoutConvert::ColMajor );
210                slate::Tile<scalar_type> T = A( i, j, slate::HostNum );
211                for (int64_t jj = 0; jj < T.nb(); ++jj) {
212                    for (int64_t ii = 0; ii < T.mb(); ++ii) {
213                        // Note: currently using T.at() is inefficient
214                        // in inner loops; see below.
215                        T.at( ii, jj )
216                            = std::abs( (ii_global + ii) - (jj_global + jj) );
217                    }
218                }
219            }
220            ii_global += A.tileMb( i );
221        }
222        jj_global += A.tileMb( j );
223    }
224    //---------- end elements
225
226    //---------- begin elements2
227    // Loop over tiles in A, more efficient implementation.
228    jj_global = 0;
229    for (int64_t j = 0; j < A.nt(); ++j) {
230        int64_t ii_global = 0;
231        for (int64_t i = 0; i < A.mt(); ++i) {
232            if (A.tileIsLocal( i, j )) {
233                // For local tiles, loop over entries in tile.
234                // Make sure CPU tile exists for writing.
235                A.tileGetForWriting( i, j, slate::HostNum, LayoutConvert::ColMajor );
236                slate::Tile<scalar_type> T = A( i, j, slate::HostNum );
237                scalar_type* data = T.data();
238                int64_t   mb      = T.mb();
239                int64_t   nb      = T.nb();
240                int64_t   stride  = T.stride();
241                for (int64_t jj = 0; jj < T.nb(); ++jj) {
242                    for (int64_t ii = 0; ii < T.mb(); ++ii) {
243                        // Currently more efficient than using T.at().
244                        data[ ii + jj*stride ]
245                            = std::abs( (ii_global + ii) - (jj_global + jj) );
246                    }
247                }
248            }
249            ii_global += A.tileMb( i );
250        }
251        jj_global += A.tileMb( j );
252    }
253    //---------- end elements2
254}
255
256//------------------------------------------------------------------------------
257// Map double => float, and complex<double> => complex
258template <typename T>
259struct mixed_precision_traits;
260
261template <>
262struct mixed_precision_traits< double >
263{
264    using low_type = float;
265};
266
267template <typename T>
268struct mixed_precision_traits< std::complex<T> >
269{
270    using real_low_type = typename mixed_precision_traits<T>::low_type;
271    using low_type = std::complex< real_low_type >;
272};
273
274//------------------------------------------------------------------------------
275template <typename scalar_type>
276void test_copy()
277{
278    // low_type is float or complex<float>.
279    using low_type = typename mixed_precision_traits< scalar_type >::low_type;
280
281    print_func( mpi_rank );
282
283    int64_t m=2000, n=1000, nb=256;
284
285    //---------- begin copy
286    // scalar_type is double or complex<double>;
287    // low_type    is float  or complex<float>.
288    slate::Matrix<scalar_type> A_hi( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
289    slate::Matrix<low_type>    A_lo( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
290    A_hi.insertLocalTiles();
291    A_lo.insertLocalTiles();
292
293    auto A_hi_2 = A_hi.emptyLike();
294    A_hi_2.insertLocalTiles();
295
296    // Copy with precision conversion from double to float.
297    copy( A_hi, A_lo );
298
299    // Copy with precision conversion from float to double.
300    copy( A_lo, A_hi );
301
302    // Copy without conversion.
303    copy( A_hi, A_hi_2 );
304    //---------- end copy
305}
306
307//------------------------------------------------------------------------------
308template <typename scalar_type>
309void test_all()
310{
311    test_constructor   <scalar_type>();
312    test_insert_host   <scalar_type>();
313    test_insert_device <scalar_type>();
314    test_insert_user   <scalar_type>();
315    test_fromScaLAPACK <scalar_type>();
316    test_elements      <scalar_type>();
317}
318
319//------------------------------------------------------------------------------
320int main( int argc, char** argv )
321{
322    try {
323        // Parse command line to set types for s, d, c, z precisions.
324        bool types[ 4 ];
325        parse_args( argc, argv, types );
326
327        int provided = 0;
328        slate_mpi_call(
329            MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &provided ) );
330        assert( provided == MPI_THREAD_MULTIPLE );
331
332        slate_mpi_call(
333            MPI_Comm_size( MPI_COMM_WORLD, &mpi_size ) );
334
335        slate_mpi_call(
336            MPI_Comm_rank( MPI_COMM_WORLD, &mpi_rank ) );
337
338        // Determine p-by-q grid for this MPI size.
339        grid_size( mpi_size, &grid_p, &grid_q );
340        if (mpi_rank == 0) {
341            printf( "mpi_size %d, grid_p %d, grid_q %d\n",
342                    mpi_size, grid_p, grid_q );
343        }
344
345        // so random_matrix is different on different ranks.
346        srand( 100 * mpi_rank );
347
348        if (types[ 0 ]) {
349            test_all< float >();
350        }
351
352        if (types[ 1 ]) {
353            test_all< double >();
354            test_copy< double >();
355        }
356
357        if (types[ 2 ]) {
358            test_all< std::complex<float> >();
359        }
360
361        if (types[ 3 ]) {
362            test_all< std::complex<double> >();
363            test_copy< std::complex<double> >();
364        }
365
366        slate_mpi_call(
367            MPI_Finalize() );
368    }
369    catch (std::exception const& ex) {
370        fprintf( stderr, "%s", ex.what() );
371        return 1;
372    }
373    return 0;
374}