Bibliography

[1] Mark Gates, Ali Charara, Jakub Kurzak, Asim YarKhan, Mohammed Al Farhan, Dalal Sukkari, and Jack Dongarra. SLATE users’ guide, SWAN no. 10. Technical Report ICL-UT-19-01, Innovative Computing Laboratory, University of Tennessee, July 2020. URL https://www.icl.utk.edu/publications/swan-010. revision 07-2020.

[2] Mark Gates, Ali Charara, Asim YarKhan, Dalal Sukkari, Mohammed Al Farhan, and Jack Dongarra. SLATE working note 14 performance tuning slate. Technical Report ICL-UT-20-01, Innovative Computing Laboratory, University of Tennessee, December 2019. URL https://www.icl.utk.edu/publications/swan-014. revision 12-2019.

[3] H. Carter Edwards, Bryce Adelstein Lelbach, Daniel Sunderland, David Hollman, Christian Trott, Mauro Bianco, Ben Sander, Athanasios Iliopoulos, John Michopoulos, and Daniel Sunderland. P0009r7 : mdspan: A Non-Owning Multidimensional Array Reference. ISO, 2018. URL http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0009r7.html.

[4] Fred Gustavson, André Henriksson, Isak Jonsson, Bo Kågström, and Per Ling. Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. Applied Parallel Computing Large Scale Scientific and Industrial Problems, 1541:195–206, 1998. doi:10.1007/BFb0095337.

[5] Fred G Gustavson, Jerzy Waśniewski, Jack J Dongarra, and Julien Langou. Rectangular full packed format for cholesky’s algorithm: factorization, solution, and inversion. ACM Transactions on Mathematical Software (TOMS), 37(2):18, 2010. doi:10.1145/1731022.1731028.

[6] Introducing the new Packed APIs for GEMM. Intel Corp., 2016. URL https://software.intel.com/en-us/articles/introducing-the-new-packed-apis-for-gemm.

[7] Fred Gustavson, Lars Karlsson, and Bo Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software (TOMS), 38(3):17, 2012. doi:10.1145/2168773.2168775.

[8] Stefan Kurz, Oliver Rain, and Sergej Rjasanow. The adaptive cross-approximation technique for the 3d boundary-element method. IEEE Transactions on Magnetics, 38(2):421–424, 2002. doi:10.1109/20.996112.

[9] Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Panruo Wu, Ichitaro Yamazaki, Asim YarKhan, Maksims Abalenkovs, Negin Bagherpour, Sven Hammarling, Jakub Šíšístek, David Stevens, Mawussi Zounon, and Samuel d. Relton. Plasma: Parallel linear algebra software for multicore using openmp. ACM Transactions on Mathematical Software (TOMS), 45:16:1–16:35, 2019. doi:10.1145/3264491.

[10] Mark Gates, Piotr Luszczek, Ahmad Abdelfattah, Jakub Kurzak, Jack Dongarra, Konstantin Arturov, Cris Cecka, and Chip Freitag. C++ api for blas and lapack. Technical Report ICL-UT-17-03, SLATE Working Note 2, Innovative Computing Laboratory, University of Tennessee, 06-2017 2017. URL https://www.icl.utk.edu/publications/swan-002.

[11] Alfredo Buttari, Jack Dongarra, Julie Langou, Julien Langou, Piotr Luszczek, and Jakub Kurzak. Mixed precision iterative refinement techniques for the solution of dense linear systems. The International Journal of High Performance Computing Applications, 21(4):457–466, 2007. doi:10.1177/1094342007084026.

[12] Erin Carson and Nicholas J Higham. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM Journal on Scientific Computing, 40(2):A817–A847, 2018. doi:10.1137/17M1140819.

[13] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 47. IEEE Press, 2018. doi:10.1109/SC.2018.00050.

[14] Yozo Hida, Xiaoye S Li, and David H Bailey. Algorithms for quad-double precision floating point arithmetic. In Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001, pages 155–162. IEEE, 2001. doi:10.1109/ARITH.2001.930115.

[15] Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, and Jack Dongarra. Hierarchical dag scheduling for hybrid distributed systems. In 2015 IEEE International Parallel and Distributed Processing Symposium, pages 156–165. IEEE, 2015. doi:10.1109/IPDPS.2015.56.

[16] Jakub Kurzak, Piotr Luszczek, Ichitaro Yamazaki, Yves Robert, and Jack Dongarra. Design and implementation of the PULSAR programming system for large scale computing. Supercomputing Frontiers and Innovations, 4(1):4–26, 2017. doi:http://dx.doi.org/10.14529/jsfi170101.