Example 01: Matrix Construction
This example demonstrates the fundamental operations for creating and managing SLATE matrices.
Key Concepts
Constructors: Creating empty matrices with specific dimensions and distribution.
Tile Insertion: Allocating memory for tiles on Host (CPU) or Devices (GPU).
User Data: Wrapping existing data pointers (e.g., from ScaLAPACK) into a SLATE matrix.
Transposition: Creating transposed views of matrices without copying data.
Element Access: Iterating over tiles and elements efficiently.
C++ Example
Matrix Construction (Lines 18-44)
// Create an empty matrix (2D block cyclic layout, p x q grid,
// no tiles allocated, square nb x nb tiles)
slate::Matrix<scalar_type>
A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
Here, we construct a standard general matrix A. Note that:
m and n are the global dimensions (rows, columns).
nb is the block size (tile size).
grid_p and grid_q define the MPI process grid dimensions.
MPI_COMM_WORLD is the communicator.
Important: This constructor defines the matrix structure and distribution, but does not allocate memory for the matrix elements (tiles). The matrix is initially empty.
We can also create matrices with rectangular tiles (specifying mb and nb) or triangular matrices (slate::TriangularMatrix). A.emptyLike() (line 42) is a convenient way to create a new matrix A2 with the same structure and distribution as A but without any data allocated.
Allocating Memory (Lines 47-68)
// Insert tiles on the CPU host.
A.insertLocalTiles( slate::Target::Host );
Since the constructor creates an empty shell, we must explicitly allocate memory for the tiles that belong to this MPI rank. insertLocalTiles is the standard way to do this.
Target::Host allocates memory on the CPU.
Target::Devices (seen in lines 71-92) allocates memory on GPU devices if available.
The loop shown in lines 63-66 demonstrates what insertLocalTiles effectively does under the hood: it iterates over the matrix structure, checks if a tile is local to the current rank, and if so, inserts (allocates) it.
Wrapping Existing Data (Lines 97-132)
// Attach user allocated tiles, from pointers in data( i, j )
// with local stride lld between columns.
for (int64_t j = 0; j < A.nt(); ++j) {
for (int64_t i = 0; i < A.mt(); ++i) {
if (A.tileIsLocal( i, j ))
A.tileInsert( i, j, data( i, j ), lld );
}
}
If you already have data allocated (e.g., from a legacy application), you can wrap it into a SLATE matrix without copying. The data lambda in this example simulates calculating the pointer to the start of each tile in a ScaLAPACK-style array. tileInsert is then called with the pointer and leading dimension (lld).
ScaLAPACK Conversion (Lines 135-162)
auto A = slate::Matrix<scalar_type>::fromScaLAPACK( ... );
This is the simplified, preferred way to wrap legacy ScaLAPACK data. It handles all the index calculations and tile insertions automatically.
Transposition (Lines 165-185)
auto AT = transpose( A );
auto AH = conj_transpose( A );
SLATE operations like transposition are typically metadata operations. AT is a view of A. No data is copied or moved.
AT has its operation flag set to Op::Trans.
Accessing AT(i, j) effectively accesses A(j, i) with transposition applied on the fly.
This allows efficient passing of transposed arguments to BLAS routines (like gemm) without overhead.
Element Access (Lines 189-254)
if (A.tileIsLocal( i, j )) {
// ...
slate::Tile<scalar_type> T = A( i, j, slate::HostNum );
for (int64_t jj = 0; jj < T.nb(); ++jj) {
for (int64_t ii = 0; ii < T.mb(); ++ii) {
T.at( ii, jj ) = ...
}
}
}
Direct element access in distributed memory requires care:
We iterate over global tiles (i, j).
We check A.tileIsLocal(i, j) to ensure the current rank owns the data.
We acquire the tile T on the host.
We iterate within the tile using local indices ii, jj.
T.at(ii, jj) provides safe access (with bounds checking in debug mode), while T.data() (shown in elements2) provides raw pointer access for maximum performance in inner loops.
1// ex01_matrix.cc
2// create 2000 x 1000 matrix on 2 x 2 MPI process grid
3
4/// !!! Lines between `//---------- begin label` !!!
5/// !!! and `//---------- end label` !!!
6/// !!! are included in the SLATE Users' Guide. !!!
7
8#include <slate/slate.hh>
9
10#include "util.hh"
11
12int mpi_size = 0;
13int mpi_rank = 0;
14int grid_p = 0;
15int grid_q = 0;
16
17//------------------------------------------------------------------------------
18template <typename scalar_type>
19void test_constructor()
20{
21 print_func( mpi_rank );
22
23 int64_t m=2000, n=1000, mb=128, nb=256;
24
25 //---------- begin constructor
26 // Create an empty matrix (2D block cyclic layout, p x q grid,
27 // no tiles allocated, square nb x nb tiles)
28 slate::Matrix<scalar_type>
29 A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
30
31 // Create an empty matrix (2D block cyclic layout, p x q grid,
32 // no tiles allocated, rectangular mb x nb tiles)
33 slate::Matrix<scalar_type>
34 B( m, n, mb, nb, grid_p, grid_q, MPI_COMM_WORLD );
35
36 // Create an empty TriangularMatrix (2D block cyclic layout, no tiles)
37 slate::TriangularMatrix<scalar_type>
38 T( slate::Uplo::Lower, slate::Diag::NonUnit, n, nb,
39 grid_p, grid_q, MPI_COMM_WORLD );
40
41 // Create an empty matrix based on another matrix structure.
42 slate::Matrix<scalar_type> A2 = A.emptyLike();
43 //---------- end constructor
44}
45
46//------------------------------------------------------------------------------
47template <typename scalar_type>
48void test_insert_host()
49{
50 print_func( mpi_rank );
51
52 int64_t m=2000, n=1000, nb=256;
53
54 //---------- begin insert_host
55 // Create two empty matrices.
56 slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
57 auto A2 = A.emptyLike();
58
59 // Insert tiles on the CPU host.
60 A.insertLocalTiles( slate::Target::Host );
61
62 // A2.insertLocalTiles( slate::Target::Host ) is equivalent to:
63 for (int64_t j = 0; j < A2.nt(); ++j)
64 for (int64_t i = 0; i < A2.mt(); ++i)
65 if (A2.tileIsLocal( i, j ))
66 A2.tileInsert( i, j, slate::HostNum );
67 //---------- end insert_host
68}
69
70//------------------------------------------------------------------------------
71template <typename scalar_type>
72void test_insert_device()
73{
74 print_func( mpi_rank );
75
76 int64_t m=2000, n=1000, nb=256;
77
78 //---------- begin insert_device
79 // Create two empty matrices.
80 slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
81 auto A2 = A.emptyLike();
82
83 // Insert tiles on the GPU devices.
84 A.insertLocalTiles( slate::Target::Devices );
85
86 // A2.insertLocalTiles( slate::Target::Devices ) is equivalent to:
87 for (int64_t j = 0; j < A2.nt(); ++j)
88 for (int64_t i = 0; i < A2.mt(); ++i)
89 if (A2.tileIsLocal( i, j ))
90 A2.tileInsert( i, j, A2.tileDevice( i, j ) );
91 //---------- end insert_device
92}
93
94//------------------------------------------------------------------------------
95// This example uses ScaLAPACK data just as an example of user-defined data,
96// but it is easier to use fromScaLAPACK to create a SLATE matrix; see below.
97template <typename scalar_type>
98void test_insert_user()
99{
100 print_func( mpi_rank );
101
102 int64_t m=2000, n=1000, nb=256;
103
104 // User-allocated data, in ScaLAPACK format (assuming column-major grid).
105 int myrow = mpi_rank % grid_p;
106 int mycol = mpi_rank / grid_p;
107 int64_t mlocal = slate::num_local_rows_cols( m, nb, myrow, 0, grid_p );
108 int64_t nlocal = slate::num_local_rows_cols( n, nb, myrow, 0, grid_p );
109 int64_t lld = mlocal; // local leading dimension
110 scalar_type* A_data = new scalar_type[ lld*nlocal ];
111
112 // lambda to get tile (i, j) in ScaLAPACK data.
113 auto data = [A_data, nb, lld]( int64_t i, int64_t j ) {
114 int64_t ii_local = slate::global2local( i*nb, nb, grid_p );
115 int64_t jj_local = slate::global2local( j*nb, nb, grid_q );
116 return &A_data[ ii_local + jj_local*lld ];
117 };
118
119 //---------- begin insert_user
120 // Create an empty matrix (2D block cyclic layout, no tiles).
121 slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
122
123 // Attach user allocated tiles, from pointers in data( i, j )
124 // with local stride lld between columns.
125 for (int64_t j = 0; j < A.nt(); ++j) {
126 for (int64_t i = 0; i < A.mt(); ++i) {
127 if (A.tileIsLocal( i, j ))
128 A.tileInsert( i, j, data( i, j ), lld );
129 }
130 }
131 //---------- end insert_user
132}
133
134//------------------------------------------------------------------------------
135template <typename scalar_type>
136void test_fromScaLAPACK()
137{
138 print_func( mpi_rank );
139
140 int64_t m=2000, n=1000, nb=256;
141
142 //---------- begin fromScaLAPACK
143 // User-allocated data, in ScaLAPACK format (assuming column-major grid).
144 int myrow = mpi_rank % grid_p;
145 int mycol = mpi_rank / grid_p;
146 int64_t mlocal = slate::num_local_rows_cols( m, nb, myrow, 0, grid_p );
147 int64_t nlocal = slate::num_local_rows_cols( n, nb, myrow, 0, grid_p );
148 int64_t lld = mlocal; // local leading dimension
149 scalar_type* A_data = new scalar_type[ lld*nlocal ];
150
151 // Create matrix from ScaLAPACK data.
152 auto A = slate::Matrix<scalar_type>::fromScaLAPACK(
153 m, n, // global matrix dimensions
154 A_data, // local ScaLAPACK array data
155 lld, // local leading dimension (column stride) for data
156 nb, nb, // block size
157 slate::GridOrder::Col, // col- or row-major MPI process grid
158 grid_p, grid_q, // MPI process grid
159 MPI_COMM_WORLD // MPI communicator
160 );
161 //---------- end fromScaLAPACK
162}
163
164//------------------------------------------------------------------------------
165template <typename scalar_type>
166void test_transpose()
167{
168 print_func( mpi_rank );
169
170 int64_t m=2000, n=1000, nb=256;
171
172 //---------- begin transpose
173 slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
174
175 // Transpose
176 // AT is a transposed view of A, with flag AT.op() == Op::Trans.
177 // The Tile AT( i, j ) == transpose( A( j, i ) ).
178 auto AT = transpose( A );
179
180 // Conjugate transpose
181 // AH is a conjugate-transposed view of A, with flag AH.op() == Op::ConjTrans.
182 // The Tile AH( i, j ) == conj_transpose( A( j, i ) ).
183 auto AH = conj_transpose( A );
184 //---------- end transpose
185}
186
187//------------------------------------------------------------------------------
188template <typename scalar_type>
189void test_elements()
190{
191 using slate::LayoutConvert;
192
193 print_func( mpi_rank );
194
195 int64_t m=2000, n=1000, nb=256;
196
197 //---------- begin elements
198 slate::Matrix<scalar_type> A( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
199 A.insertLocalTiles( slate::Target::Host );
200
201 // Loop over tiles in A.
202 int64_t jj_global = 0;
203 for (int64_t j = 0; j < A.nt(); ++j) {
204 int64_t ii_global = 0;
205 for (int64_t i = 0; i < A.mt(); ++i) {
206 if (A.tileIsLocal( i, j )) {
207 // For local tiles, loop over entries in tile.
208 // Make sure CPU tile exists for writing.
209 A.tileGetForWriting( i, j, slate::HostNum, LayoutConvert::ColMajor );
210 slate::Tile<scalar_type> T = A( i, j, slate::HostNum );
211 for (int64_t jj = 0; jj < T.nb(); ++jj) {
212 for (int64_t ii = 0; ii < T.mb(); ++ii) {
213 // Note: currently using T.at() is inefficient
214 // in inner loops; see below.
215 T.at( ii, jj )
216 = std::abs( (ii_global + ii) - (jj_global + jj) );
217 }
218 }
219 }
220 ii_global += A.tileMb( i );
221 }
222 jj_global += A.tileMb( j );
223 }
224 //---------- end elements
225
226 //---------- begin elements2
227 // Loop over tiles in A, more efficient implementation.
228 jj_global = 0;
229 for (int64_t j = 0; j < A.nt(); ++j) {
230 int64_t ii_global = 0;
231 for (int64_t i = 0; i < A.mt(); ++i) {
232 if (A.tileIsLocal( i, j )) {
233 // For local tiles, loop over entries in tile.
234 // Make sure CPU tile exists for writing.
235 A.tileGetForWriting( i, j, slate::HostNum, LayoutConvert::ColMajor );
236 slate::Tile<scalar_type> T = A( i, j, slate::HostNum );
237 scalar_type* data = T.data();
238 int64_t mb = T.mb();
239 int64_t nb = T.nb();
240 int64_t stride = T.stride();
241 for (int64_t jj = 0; jj < T.nb(); ++jj) {
242 for (int64_t ii = 0; ii < T.mb(); ++ii) {
243 // Currently more efficient than using T.at().
244 data[ ii + jj*stride ]
245 = std::abs( (ii_global + ii) - (jj_global + jj) );
246 }
247 }
248 }
249 ii_global += A.tileMb( i );
250 }
251 jj_global += A.tileMb( j );
252 }
253 //---------- end elements2
254}
255
256//------------------------------------------------------------------------------
257// Map double => float, and complex<double> => complex
258template <typename T>
259struct mixed_precision_traits;
260
261template <>
262struct mixed_precision_traits< double >
263{
264 using low_type = float;
265};
266
267template <typename T>
268struct mixed_precision_traits< std::complex<T> >
269{
270 using real_low_type = typename mixed_precision_traits<T>::low_type;
271 using low_type = std::complex< real_low_type >;
272};
273
274//------------------------------------------------------------------------------
275template <typename scalar_type>
276void test_copy()
277{
278 // low_type is float or complex<float>.
279 using low_type = typename mixed_precision_traits< scalar_type >::low_type;
280
281 print_func( mpi_rank );
282
283 int64_t m=2000, n=1000, nb=256;
284
285 //---------- begin copy
286 // scalar_type is double or complex<double>;
287 // low_type is float or complex<float>.
288 slate::Matrix<scalar_type> A_hi( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
289 slate::Matrix<low_type> A_lo( m, n, nb, grid_p, grid_q, MPI_COMM_WORLD );
290 A_hi.insertLocalTiles();
291 A_lo.insertLocalTiles();
292
293 auto A_hi_2 = A_hi.emptyLike();
294 A_hi_2.insertLocalTiles();
295
296 // Copy with precision conversion from double to float.
297 copy( A_hi, A_lo );
298
299 // Copy with precision conversion from float to double.
300 copy( A_lo, A_hi );
301
302 // Copy without conversion.
303 copy( A_hi, A_hi_2 );
304 //---------- end copy
305}
306
307//------------------------------------------------------------------------------
308template <typename scalar_type>
309void test_all()
310{
311 test_constructor <scalar_type>();
312 test_insert_host <scalar_type>();
313 test_insert_device <scalar_type>();
314 test_insert_user <scalar_type>();
315 test_fromScaLAPACK <scalar_type>();
316 test_elements <scalar_type>();
317}
318
319//------------------------------------------------------------------------------
320int main( int argc, char** argv )
321{
322 try {
323 // Parse command line to set types for s, d, c, z precisions.
324 bool types[ 4 ];
325 parse_args( argc, argv, types );
326
327 int provided = 0;
328 slate_mpi_call(
329 MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &provided ) );
330 assert( provided == MPI_THREAD_MULTIPLE );
331
332 slate_mpi_call(
333 MPI_Comm_size( MPI_COMM_WORLD, &mpi_size ) );
334
335 slate_mpi_call(
336 MPI_Comm_rank( MPI_COMM_WORLD, &mpi_rank ) );
337
338 // Determine p-by-q grid for this MPI size.
339 grid_size( mpi_size, &grid_p, &grid_q );
340 if (mpi_rank == 0) {
341 printf( "mpi_size %d, grid_p %d, grid_q %d\n",
342 mpi_size, grid_p, grid_q );
343 }
344
345 // so random_matrix is different on different ranks.
346 srand( 100 * mpi_rank );
347
348 if (types[ 0 ]) {
349 test_all< float >();
350 }
351
352 if (types[ 1 ]) {
353 test_all< double >();
354 test_copy< double >();
355 }
356
357 if (types[ 2 ]) {
358 test_all< std::complex<float> >();
359 }
360
361 if (types[ 3 ]) {
362 test_all< std::complex<double> >();
363 test_copy< std::complex<double> >();
364 }
365
366 slate_mpi_call(
367 MPI_Finalize() );
368 }
369 catch (std::exception const& ex) {
370 fprintf( stderr, "%s", ex.what() );
371 return 1;
372 }
373 return 0;
374}