CUDA Programming: A Developer's Guide to Parallel Computing with GPUs (Applications of Gpu Computing)

CUDA Programming: A Developer's Guide to Parallel Computing with GPUs (Applications of Gpu Computing)

Shane Cook

Language: English

Pages: 600

ISBN: 0124159338

Format: PDF / Kindle (mobi) / ePub

If you need to learn CUDA but don't have experience with parallel computing, CUDA Programming: A Developer's Introduction offers a detailed guide to CUDA with a grounding in parallel fundamentals. It starts by introducing CUDA and bringing you up to speed on GPU parallelism and hardware, then delving into CUDA installation. Chapters on core concepts including threads, blocks, grids, and memory focus on both parallel and CUDA-specific issues. Later, the book demonstrates CUDA in practice for optimizing applications, adjusting to new hardware, and solving common problems.

  • Comprehensive introduction to parallel programming with CUDA, for readers new to both
  • Detailed instructions help readers optimize the CUDA software development kit
  • Practical techniques illustrate working with memory, threads, algorithms, resources, and more
  • Covers CUDA on multiple hardware platforms: Mac, Linux and Windows with several NVIDIA chipsets
  • Each chapter includes exercises to test reader knowledge

Multidimensional Particle Swarm Optimization for Machine Learning and Pattern Recognition

Multidimensional Particle Swarm Optimization for Machine Learning and Pattern Recognition

Haptics: Generating and Perceiving Tangible Sensations: International Conference, EuroHaptics 2010, Amsterdam, July 2010, Proceedings Part 1

Abstraction in Artificial Intelligence and Complex Systems

Group Policy: Fundamentals, Security, and the Managed Desktop (3rd Edition)

Genetic Programming Theory and Practice XI (Genetic and Evolutionary Computation)












that are becoming progressively obsolete. GPUs are changing the face of computing. All of a sudden the computing power of supercomputers from a decade ago can be slotted under your desk. No longer must you wait in a queue to submit work batches and wait months for a committee to approve your request to use limited computer resources at overstretched computing installations. You can go out, spend up to 5000–10,000 USD, and have a supercomputer on your desk, or a development machine that runs CUDA

node or box. Thus, you are limited to problems that fit within a single box in terms of processing power, memory capacity, and storage space. Programming, however, is relatively easy as most of the low-level threading code (otherwise written using Windows threads or POSIX threads) is taken care of for you by OpenMP. The MPI (Message Passing Interface) standard addresses parallelism between nodes and is aimed at clusters of machines within well-defined networks. It is often used in supercomputer

would simply branch on the thread ID, as follows: if ((thread_idx % 32) < 16) { action_a(); } else { action_b(); } The modulus operator in C (%) returns the remainder of the integer division of the operand. In effect, you count from 0 to 31 and then loop back to 0 again. Ideally, the function action_a() has each of its 16 threads access a single float or integer value. This causes a single 64-byte memory fetch. The following half warp does the same and thus you issue a single 128-byte memory

known and fixed. Consequently, the GPU can allocate a fixed set of registers for each block scheduled onto the hardware. At a thread level, this is transparent to the programmer. However, a kernel that requests too many registers per thread can limit the number of blocks the GPU can schedule on an SM, and thus the total number of threads that will be run. Too few threads and you start underutilizing the hardware and the performance starts to rapidly drop off. Too many threads can mean you run

base_cnt_1þþ; } else { cpu_tmp_0[base_cnt_0] ¼ d; base_cnt_0þþ; } } // Copy data back to source - first the zero list for (u32 i¼0; i

Download sample