Cover image for CUDA Fortran for Scientists and Engineers : Best Practices for Efficient CUDA Fortran Programming.
CUDA Fortran for Scientists and Engineers : Best Practices for Efficient CUDA Fortran Programming.
Title:
CUDA Fortran for Scientists and Engineers : Best Practices for Efficient CUDA Fortran Programming.
Author:
Ruetsch, Gregory.
ISBN:
9780124169722
Personal Author:
Physical Description:
1 online resource (339 pages)
Contents:
Half Title -- Title Page -- Copyright -- Dedication -- Contents -- Acknowledgments -- Preface -- PART I: CUDA Fortran Programming -- 1 Introduction -- 1.1 A Brief History of GPU Computing -- 1.2 Parallel computation -- 1.3 Basic Concepts -- 1.3.1 A first CUDA Fortran program -- 1.3.2 Extending to larger arrays -- 1.3.3 Multidimensional arrays -- 1.4 Determining CUDA Hardware Features and Limits -- 1.4.1 Single and double precision -- 1.4.1.1 Accommodating variable precision -- 1.5 Error Handling -- 1.6 Compiling CUDA Fortran Code -- 1.6.1 Separate compilation -- 2 Performance Measurement and Metrics -- 2.1 Measuring kernel execution time -- 2.1.1 Host-device synchronization and CPU timers -- 2.1.2 Timing via CUDA events -- 2.1.3 Command Line Profiler -- 2.1.4 The nvprof profiling tool -- 2.2 Instruction, bandwidth, and latency bound kernels -- 2.3 Memory bandwidth -- 2.3.1 Theoretical peak bandwidth -- 2.3.2 Effective bandwidth -- 2.3.3 Actual data throughput vs. effective bandwidth -- 3 Optimization -- 3.1 Transfers between host and device -- 3.1.1 Pinned memory -- 3.1.2 Batching small data transfers -- 3.1.2.1 Explicit transfers using cudaMemcpy() -- 3.1.3 Asynchronous data transfers (advanced topic) -- 3.1.3.1 Hyper-Q -- 3.1.3.2 Profiling asynchronous events -- 3.2 Device memory -- 3.2.1 Declaring data in device code -- 3.2.2 Coalesced access to global memory -- 3.2.2.1 Misaligned access -- 3.2.2.2 Strided access -- 3.2.3 Texture memory -- 3.2.4 Local memory -- 3.2.4.1 Detecting local memory use (advanced topic) -- 3.2.5 Constant memory -- 3.2.5.1 Detecting constant memory use (advanced topic) -- 3.3 On-chip memory -- 3.3.1 L1 cache -- 3.3.2 Registers -- 3.3.3 Shared memory -- 3.3.3.1 Detecting shared memory usage (advanced topic) -- 3.3.3.2 Shared memory bank conflicts -- 3.4 Memory optimization example: matrix transpose.

3.4.1 Partition camping (advanced topic) -- 3.4.1.1 Diagonal reordering -- 3.5 Execution configuration -- 3.5.1 Thread-level parallelism -- 3.5.1.1 Shared memory -- 3.5.2 Instruction-level parallelism -- 3.6 Instruction optimization -- 3.6.1 Device intrinsics -- 3.6.1.1 Directed rounding -- 3.6.1.2 C intrinsics -- 3.6.1.3 Fast math intrinsics -- 3.6.2 Compiler options -- 3.6.3 Divergent warps -- 3.7 Kernel loop directives -- 3.7.1 Reductions in CUF kernels -- 3.7.2 Streams in CUF kernels -- 3.7.3 Instruction-level parallelism in CUF kernels -- 4 Multi-GPU Programming -- 4.1 CUDA multi-GPU features -- 4.1.1 Peer-to-peer communication -- 4.1.1.1 Requirements for peer-to-peer communication -- 4.1.2 Peer-to-peer direct transfers -- 4.1.3 Peer-to-peer transpose -- 4.2 Multi-GPU Programming with MPI -- 4.2.1 Assigning devices to MPI ranks -- 4.2.2 MPI transpose -- 4.2.3 GPU-aware MPI transpose -- PART II: Case Studies -- 5 Monte Carlo Method -- 5.1 CURAND -- 5.2 Computing π with CUF kernels -- 5.2.1 IEEE-754 precision (advanced topic) -- 5.3 Computing π with reduction kernels -- 5.3.1 Reductions with atomic locks (advanced topic) -- 5.4 Accuracy of summation -- 5.5 Option pricing -- 6 Finite Difference Method -- 6.1 Nine-Point 1D finite difference stencil -- 6.1.1 Data reuse and shared memory -- 6.1.2 The x-derivative kernel -- 6.1.2.1 Performance of the x-derivative kernel -- 6.1.3 Derivatives in y and z -- 6.1.3.1 Leveraging transpose -- 6.1.4 Nonuniform grids -- 6.2 2D Laplace equation -- 7 Applications of Fast Fourier Transform -- 7.1 CUFFT -- 7.2 Spectral derivatives -- 7.3 Convolution -- 7.4 Poisson Solver -- PART III: Appendices -- Appendix A: Tesla Specifications -- Appendix B: System and Environment Management -- B.1 Environment variables -- B.1.1 General -- B.1.2 Command Line Profiler -- B.1.3 Just-in-time compilation.

B.2 nvidia-smi System Management Interface -- B.2.1 Enabling and disabling ECC -- B.2.2 Compute mode -- B.2.3 Persistence mode -- Appendix C: Calling CUDA C from CUDA Fortran -- C.1 Calling CUDA C libraries -- C.2 Calling User-Written CUDA C Code -- Appendix D: Source Code -- D.1 Texture memory -- D.2 Matrix transpose -- D.3 Thread- and instruction-level parallelism -- D.4 Multi-GPU programming -- D.4.1 Peer-to-peer transpose -- D.4.2 MPI transpose with host MPI transfers -- D.4.3 MPI transpose with device MPI transfers -- D.5 Finite difference code -- D.6 Spectral Poisson Solver -- References -- Index -- A -- B -- C -- D -- E -- F -- G -- H -- I -- J -- K -- L -- M -- N -- O -- P -- R -- S -- T -- U -- V -- W.
Abstract:
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran. To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison. Leverage the power of GPU computing with PGI's CUDA Fortran compiler Gain insights from members of the CUDA Fortran language development team Includes multi-GPU programming in CUDA Fortran, covering both peer-to-peer and message passing interface (MPI) approaches Includes full source code for all the examples and several case studies Download source code and slides from the book's companion website.
Local Note:
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Added Author:
Electronic Access:
Click to View
Holds: Copies: