“Book”

2. CUDA Memory Management Most of the application’s performance will be bottlecked by memory-related constraints GPU RAM BW : 900GB/s (DDR3 ?) NV Visual Profiler Global memory is a staging area where all of the date gets copied from CPU memory. Global Memory(device memory) is visible to all of the threads in the kernel and also visible to CPu. Coalesced vs. uncoalesced global memory access coalesced global memory access : Sequential memory access is adjacent Warp Warp is a unit of thread scheduling/execution in SMs.

1. Introduction to CUDA Programming CPU Architecture is optimized for low latency accessing while GPU architecture is optimized for data parallel throughput compution CPU hides latency of data by frequently stroring used data in caches and utilize the temporal locality In CUDA, the execution unit is a warp not a thread. Context switching is happens between the warps and not threads. GPU has lots of registers, all the thread context switching information is already present in the registers.

Chapter 2. CUDA Memory Management

Chapter 1. Introduction to CUDA Programming