An Introduction to Parallel Programming
Format: PDF / Kindle (mobi) / ePub
Author Peter Pacheco uses a tutorial approach to show students how to develop effective parallel programs with MPI, Pthreads, and OpenMP. The first undergraduate text to directly address compiling and running parallel programs on the new multi-core and cluster architecture, An Introduction to Parallel Programming explains how to design, debug, and evaluate the performance of distributed and shared-memory programs. User-friendly exercises teach students how to compile, run and modify example programs.
directive and a schedule(guided) clause, then when n = 10,000 and thread_count = 2, the iterations are assigned as shown in Table 5.3. We see that the size of the chunk is approximately the number of iterations remaining divided by the number of threads. The first chunk has size 9999/2 ≈ 5000, since there are 9999 unassigned iterations. The second chunk has size 4999/2 ≈ 2500, and so on. Table 5.3. Assignment of Trapezoidal Rule Iterations 1–9999 using a guided Schedule with Two Threads In a
for loop. What is the default assignment of iterations on your system? How are guided schedules determined? 5.10 Recall that all structured blocks modified by an unnamed critical directive form a single critical section. What happens if we have a number of atomic directives in which different variables are being modified? Are they all treated as a single critical section? We can write a small program that tries to determine this. The idea is to have all the threads simultaneously execute
When a single thread executes some code in the Pthreads version, the test if (my_rank == whatever) can be replaced by the OpenMP directive # pragma omp single This will insure that the following structured block of code will be executed by one thread in the team, and the other threads in the team will wait in an implicit barrier at the end of the block until the executing thread is finished. When whatever is 0 (as it is in each test in the Pthreads program), the test can also be replaced by
x[i] += y[i]; In this setting, we must load each element of y into an ALU and determine whether it’s positive. If y[i] is positive, we can proceed to carry out the addition. Otherwise, the ALU storing y[i] will be idle while the other ALUs carry out the addition. Note also that in a “classical” SIMD system, the ALUs must operate synchronously, that is, each ALU must wait for the next instruction to be broadcast before proceeding. Further, the ALUs have no instruction storage, so an ALU can’t
square—it’s often the case that a program using matrix-vector multiplication will execute the multiplication many times and the result vector y from one multiplication will be the input vector x for the next iteration. In practice, then, we usually assume that the distribution for x is the same as the distribution for y. So if x has a block distribution, how can we arrange that each process has access to all the components of x before we execute the following loop? for (j = 0; j < n; j++) y[i]