High Performance Computing: Programming and Applications (Chapman & Hall/CRC Computational Science)
Format: PDF / Kindle (mobi) / ePub
High Performance Computing: Programming and Applications presents techniques that address new performance issues in the programming of high performance computing (HPC) applications. Omitting tedious details, the book discusses hardware architecture concepts and programming techniques that are the most pertinent to application developers for achieving high performance. Even though the text concentrates on C and Fortran, the techniques described can be applied to other languages, such as C++ and Java.
Drawing on their experience with chips from AMD and systems, interconnects, and software from Cray Inc., the authors explore the problems that create bottlenecks in attaining good performance. They cover techniques that pertain to each of the three levels of parallelism:
- Message passing between the nodes
- Shared memory parallelism on the nodes or the multiple instruction, multiple data (MIMD) units on the accelerator
- Vectorization on the inner level
After discussing architectural and software challenges, the book outlines a strategy for porting and optimizing an existing application to a large massively parallel processor (MPP) system. With a look toward the future, it also introduces the use of general purpose graphics processing units (GPGPUs) for carrying out HPC computations. A companion website at www.hybridmulticoreoptimization.com contains all the examples from the book, along with updated timing results on the latest released processors.
the grid. When using collectives to perform the transposes, nothing can be overlapped with the transpose. If the 3D FFT were written to use point-to-point communication, t he communication could be o verlapped w ith t he execution of t he FFT. This is not an easy task; however, it could result in the communication completely hiding the computation of the FFT. Another powerful technique is combining collectives whenever possible. Oftentimes, numerous global reduces are performed at each iteration
receive when an MPI_ SENDRECV is used; however, this is not a long enough time to prepost the message. Since the MPI tasks are operating independently, the likelihood of a p rocessor receiving a m essage from another processor before it executes the MPI_SENDRECV is very high. Ideally, the receive is posted a s soon as the buffer to contain the message is available. In r eviewing t he po int-to-point co mmunication, t he f ollowing q uestions should be asked: 1. What is the earliest time I can
DO 41033 I = 1, N Y(I) = c0 + X(I)*(C1 + X(I)*(C2 + X(I)*(C3 + X(I)))) 41033 CONTINUE But t he next loop indexes t he array indirectly t hrough a nother array, and would require gather/scatter instructions to make vectorization possible. M ost co mpilers w ould n ot v ectorize t his l oop bec ause o f t he overhead: DO 41012 I = 1, N Y(IY(I)) = c0 + X(IX(I))*(C1 + X(IX(I))*(C2 + X(IX(I)))) 41012 CONTINUE Indexing a rrays w ith a st ride, a s i n t he f ollowing l oop, w ould a lso require
that accesses C contiguously is not very long. DO 45020 I = 1, N F(I) = A(I) + .5 Single Core Optimization ◾ 145 DO 45020 J = 1, 10 D(I,J) = B(J) * F(I) DO 45020 K = 1, 5 C(K,I,J) = D(I,J) * E(K) 45020 CONTINUE If we take the approach of pulling the longest loop on the inside, then we have a st ride on t he C a rray. Since t he K i ndex is short, a nd I i s t he second i ndex it w ill st ill ma ke good utilization of t he c ache. I f t he first dimension of C i s 5, t hen 5*N contiguous
hold the incoming message. The se arrays Parallelism across the Nodes ◾ 169 are not in use and are not needed until the message is received. Now the nine-point stencil is performed by each processor and once this computation is complete, the buffers for the east–west messages are packed into a contiguous section of memory. Once this is completed, the sends are made to the east–west neighbor processors. At this time, the processor must wait for two of the receives to be co mpleted. Now the