High Performance Computing: Programming and Applications (Chapman & Hall/CRC Computational Science)

High Performance Computing: Programming and Applications (Chapman & Hall/CRC Computational Science)

Language: English

Pages: 244

ISBN: 1420077058

Format: PDF / Kindle (mobi) / ePub

High Performance Computing: Programming and Applications presents techniques that address new performance issues in the programming of high performance computing (HPC) applications. Omitting tedious details, the book discusses hardware architecture concepts and programming techniques that are the most pertinent to application developers for achieving high performance. Even though the text concentrates on C and Fortran, the techniques described can be applied to other languages, such as C++ and Java.

Drawing on their experience with chips from AMD and systems, interconnects, and software from Cray Inc., the authors explore the problems that create bottlenecks in attaining good performance. They cover techniques that pertain to each of the three levels of parallelism:

  • Message passing between the nodes
  • Shared memory parallelism on the nodes or the multiple instruction, multiple data (MIMD) units on the accelerator
  • Vectorization on the inner level

After discussing architectural and software challenges, the book outlines a strategy for porting and optimizing an existing application to a large massively parallel processor (MPP) system. With a look toward the future, it also introduces the use of general purpose graphics processing units (GPGPUs) for carrying out HPC computations. A companion website at www.hybridmulticoreoptimization.com contains all the examples from the book, along with updated timing results on the latest released processors.

Understanding and Applying Machine Vision (2nd Edition) (Manufacturing Engineering and Materials Processing)

Introduction to Theoretical Computer Science (Series in Computer Science, Volume 23)

Genetic Programming Theory and Practice VII (Genetic and Evolutionary Computation)

Windows Developer Power Tools: Turbocharge Windows Development with more than 170 free tools

Formal Languages and Compilation (2nd Edition) (Texts in Computer Science)

Software Testing using Visual Studio 2012















the grid. When using collectives to perform the transposes, nothing can be overlapped with the transpose. If the 3D FFT were written to use point-to-point communication, t he communication could be o verlapped w ith t he execution of t he FFT. This is not an easy task; however, it could result in the communication completely hiding the computation of the FFT. Another powerful technique is combining collectives whenever possible. Oftentimes, numerous global reduces are performed at each iteration

receive when an MPI_ SENDRECV is used; however, this is not a long enough time to prepost the message. Since the MPI tasks are operating independently, the likelihood of a p rocessor receiving a m essage from another processor before it executes the MPI_SENDRECV is very high. Ideally, the receive is posted a s soon as the buffer to contain the message is available. In r eviewing t he po int-to-point co mmunication, t he f ollowing q uestions should be asked: 1. What is the earliest time I can

DO 41033 I = 1, N Y(I) = c0 + X(I)*(C1 + X(I)*(C2 + X(I)*(C3 + X(I)))) 41033 CONTINUE But t he next loop indexes t he array indirectly t hrough a nother array, and would require gather/scatter instructions to make vectorization possible. M ost co mpilers w ould n ot v ectorize t his l oop bec ause o f t he overhead: DO 41012 I = 1, N Y(IY(I)) = c0 + X(IX(I))*(C1 + X(IX(I))*(C2 + X(IX(I)))) 41012 CONTINUE Indexing a rrays w ith a st ride, a s i n t he f ollowing l oop, w ould a lso require

that accesses C contiguously is not very long. DO 45020 I = 1, N F(I) = A(I) + .5 Single Core Optimization ◾ 145 DO 45020 J = 1, 10 D(I,J) = B(J) * F(I) DO 45020 K = 1, 5 C(K,I,J) = D(I,J) * E(K) 45020 CONTINUE If we take the approach of pulling the longest loop on the inside, then we have a st ride on t he C a rray. Since t he K i ndex is short, a nd I i s t he second i ndex it w ill st ill ma ke good utilization of t he c ache. I f t he first dimension of C i s 5, t hen 5*N contiguous

hold the incoming message. The se arrays Parallelism across the Nodes ◾ 169 are not in use and are not needed until the message is received. Now the nine-point stencil is performed by each processor and once this computation is complete, the buffers for the east–west messages are packed into a contiguous section of memory. Once this is completed, the sends are made to the east–west neighbor processors. At this time, the processor must wait for two of the receives to be co mpleted. Now the

Download sample