Chapter 1. Introduction to CUDA Programming
Page content
1. Introduction to CUDA Programming
- CPU Architecture is optimized for low latency accessing while GPU architecture is optimized for data parallel throughput compution
- CPU hides latency of data by frequently stroring used data in caches and utilize the temporal locality
- In CUDA, the execution unit is a warp not a thread. Context switching is happens between the warps and not threads.
- GPU has lots of registers, all the thread context switching information is already present in the registers.(No context switching overhead unlike the CPU)
- Host code vs. Device code.
- Host memory vs. Device memory
- The return type of device function is always
void
. - Data-parallel portions of an algorithm are executed on the device are kernels.
- All the kernels in CUDA are asynchronous in nature. Host need to wait for the device to finish.
cudaDeviceSynchronize
- Software X runs on/as HW Y
- CUDA thread <-> CUDA core/SIMD code
- CUDA block <-> SM
- Grid/kenrel <-> GPU device
- One block runs on a single SM. All the threads within one block can only execute on cores in one SM.
<< BlockDim, ThreadDim >>
blockIdx
,threadIdx
: IndexblockDim
,threadDim
: Dimension (==Size)blockDim
is the number of threads per block
- Threads have mechanism to communicate and synchronize efficiently.
- The CUDA programming model allows this communication for threads whiten the same block
- The therads communicate with each other in the same block using a special memory shared memory
- Threads belonging to different block cannot communicate/synchronize with each other during the execution of the kernel.
cudaError_t e
.cudaGetLastError
. Even for the multiple error, only the last one is returned.a<<< , >>>; cudaDeviceSynchronize(); e = cudaGetLastError();
Backlink
[[Learn CUDA Programming]]
Date
Oct 19, 2020 4:03 PM
#cuda #book/learn-cuda-programming