Chapter 1. Introduction to CUDA Programming

November 02, 2020

Book

Page content

1. Introduction to CUDA Programming

CPU Architecture is optimized for low latency accessing while GPU architecture is optimized for data parallel throughput compution
CPU hides latency of data by frequently stroring used data in caches and utilize the temporal locality
In CUDA, the execution unit is a warp not a thread. Context switching is happens between the warps and not threads.
GPU has lots of registers, all the thread context switching information is already present in the registers.(No context switching overhead unlike the CPU)
Host code vs. Device code.
Host memory vs. Device memory
The return type of device function is always void.
Data-parallel portions of an algorithm are executed on the device are kernels.
All the kernels in CUDA are asynchronous in nature. Host need to wait for the device to finish. cudaDeviceSynchronize
Software X runs on/as HW Y
- CUDA thread <-> CUDA core/SIMD code
- CUDA block <-> SM
- Grid/kenrel <-> GPU device
One block runs on a single SM. All the threads within one block can only execute on cores in one SM.
<< BlockDim, ThreadDim >>
- blockIdx, threadIdx : Index
- blockDim, threadDim : Dimension (==Size)
- blockDim is the number of threads per block
Threads have mechanism to communicate and synchronize efficiently.
- The CUDA programming model allows this communication for threads whiten the same block
- The therads communicate with each other in the same block using a special memory shared memory
Threads belonging to different block cannot communicate/synchronize with each other during the execution of the kernel.
cudaError_t e. cudaGetLastError. Even for the multiple error, only the last one is returned. a<<< , >>>; cudaDeviceSynchronize(); e = cudaGetLastError();

Backlink

[[Learn CUDA Programming]]

Date

Oct 19, 2020 4:03 PM

#cuda #book/learn-cuda-programming