Chapter 1. Introduction to CUDA Programming

Page content

1. Introduction to CUDA Programming

  • CPU Architecture is optimized for low latency accessing while GPU architecture is optimized for data parallel throughput compution
  • CPU hides latency of data by frequently stroring used data in caches and utilize the temporal locality
  • In CUDA, the execution unit is a warp not a thread. Context switching is happens between the warps and not threads.
  • GPU has lots of registers, all the thread context switching information is already present in the registers.(No context switching overhead unlike the CPU)
  • Host code vs. Device code.
  • Host memory vs. Device memory
  • The return type of device function is always void.
  • Data-parallel portions of an algorithm are executed on the device are kernels.
  • All the kernels in CUDA are asynchronous in nature. Host need to wait for the device to finish. cudaDeviceSynchronize
  • Software X runs on/as HW Y
    • CUDA thread <-> CUDA core/SIMD code
    • CUDA block <-> SM
    • Grid/kenrel <-> GPU device
  • One block runs on a single SM. All the threads within one block can only execute on cores in one SM.
  • << BlockDim, ThreadDim >>
    • blockIdx, threadIdx : Index
    • blockDim, threadDim : Dimension (==Size)
    • blockDim is the number of threads per block
  • Threads have mechanism to communicate and synchronize efficiently.
    • The CUDA programming model allows this communication for threads whiten the same block
    • The therads communicate with each other in the same block using a special memory shared memory
  • Threads belonging to different block cannot communicate/synchronize with each other during the execution of the kernel.
  • cudaError_t e. cudaGetLastError. Even for the multiple error, only the last one is returned. a<<< , >>>; cudaDeviceSynchronize(); e = cudaGetLastError();

[[Learn CUDA Programming]]

Date

Oct 19, 2020 4:03 PM

#cuda #book/learn-cuda-programming