Chapter 2. CUDA Memory Management
Page content
2. CUDA Memory Management
- Most of the application’s performance will be bottlecked by memory-related constraints
- GPU RAM BW : 900GB/s (DDR3 ?)
- NV Visual Profiler
- Global memory is a staging area where all of the date gets copied from CPU memory.
- Global Memory(device memory) is visible to all of the threads in the kernel and also visible to CPu.
- Coalesced vs. uncoalesced global memory access
- coalesced global memory access : Sequential memory access is adjacent
Warp
- Warp is a unit of thread scheduling/execution in SMs. Once a block has been assigned to an SM, it is divided into a 32-threads unit known as a warp
- Among all of the available warps, the ones with operands that are ready for the next instruction become eligible for execution.
- All of the threads in a warp execute the same instruction when selected.
AOS vs. SOA
- AOS : Array of Structure.
A[0].a, A[1].a, ...
- SOA : Structure of Array. Each member of structure is array
S.a[0], S[1], ...
- Suitable for SIMT - same operation for the same member with different array index. In this case, the threads of the same block access adjacent memory spaces in turn increase the spatial locality.
- As a GPU is latency-hiding architecture, it becomes important to saturate the memory bandwidth.
Shared Memory
- User-Managed Cache
- Shared memory is only visible to the threads in the same block.
- All of the threads in a block see the same version of shared memory.
- Threads in other SM can not see this shared memory
- Another block has its own shared memory
- Even in the same SM, different block has different share memory.
- Key usage of shared memory comes from that threads within a block can share memory access
- Shared variable can be located in the shared memory to be accessed multiple times.
- CDUA 9.0 provides inter-thread communication between ones in the different SMs
- Bank
- Shared memory is organized into banks to achieve higher bandwidth.
- Each bank can serve one address per cycle.
- Volta GPU has 32 banks each 4 bytes wide. 128 Bytes at one cycle
- Bank Conflict
- If multiple threads access the same Bank, the access to the shared memory is serialized. This should be avoid if possible.
Read-only Cache
- Referred to as texture cache
const __restrict__
- Ideally used when the entire warp to read the same address/data
Registers
- Scope : a single thread. Each thread has its own registers
- Local variables are stored in the registers
- Too many local variable can cause performance issue as the data should be reside in L1 or L2 cache or device memory
- register spills
Pinned-memory
- Recommendations to reduce host/device memory copy
- Minimize amount of data to be transferred
- Use the pinned memory
- Batch small transfers into one large transfer
- Asynchronous data transfer
malloc
allocates pageable memory- Device including GPU can not access the pageable memory.
- When the device access the pageable memory, the driver allocate temporary pinned memory and copy the data from the pageable memory and do DMA
- This introduce additional latency
cudaMallocHost
allocates pinned memory from the system memory.- Too much use of pinned memory impact on the system performance as non-pageable memory is also used by the system(OS)
Unified Memory
- Provide a single memory space accessible from CPU and GPU - easy programming and porting the CPU application to GPU
- Allow over-subscription
- A single pointer is used by the CPU and GPU while non-unified memory case, each has to have its own pointer as host memory and device memory are different.
cudaMallocManaged
does not allocate the physical memory when it is called but it allocate when the data is touch for the first time. This requirespage migration
and introduce additional time.- Workaround 1 : define an initialization kernel on the GPU which touch the unified memory space on behalf of the workload kernel.
- Workaround 2: Prefetch
Initialization kernel
- 왜 page fault 횟수가 줄어드는 지 잘 이해가 안되네…. 그리고 여러 thread가 동시에 접근하는 건 page fault 회수가 1번인가?? 예제가 잘 이해가 안됨…
Prefetch
cudaMemPrefetchAsync
cudaMemAdviseSetReadyMostly
cudaMemAdviseSetPreferredLocation
cudaMemAdviseSetAccessedBy
Volta combines the shared memory and L1 cache
- Total 128KB
- Shared memory can be allocated up to 96KB
#cuda #book/learn-cuda-programming