Download presentation
Presentation is loading. Please wait.
1
6주차:『GPU(CUDA) Programming』
[CSE 4152] 고급 소프트웨어 실습 I 6주차:『GPU(CUDA) Programming』 (화) 안재풍 ( AS907,
2
CUDA Memory Architecture
Warp 하나의 SM에서 동시에 작동하는 thread의 단위 연속된 32개의 threads를 뜻함 Memory access 및 연산은 warp 단위로 동시에 수행 SM Shared memory Shared memory Shared memory Shared memory Threads block (8x32)
3
CUDA Memory Architecture (Global Memory)
Global memory access pattern Warp 단위로 code가 동시에 수행되는 구조이기 때문에 warp 내의 모든 thread의 memory access가 완료되어야 다음 명령어를 수행 가장 빈번하게 사용되는 global memory access는 접근 형태에 따라 최적의 access 시간 대비 최대 16배의 시간이 소요될 수 있음 한번의 global memory access는 연속된 128byte 단위로 memory access가 발생 32개의 thread warp가 연속된 memory 영역을 access할 경우 가장 효율적 만약 32개의 thread warp가 비연속적인 memory 영역을 access할 경우 비 효율적
4
CUDA Memory Architecture (Global Memory)
1x128byte memory transaction at 128 128 256 Memory address warp 2x128byte memory transaction at 128, 256 31 128 256 Memory address warp 16x128byte memory transaction 31 128 256 Memory address … … 31 warp
5
CUDA Memory Architecture (Global Memory)
Array of structure VS structure of array CPU : array of structure Cache hit ratio ↑ GPU : structure of array Global memory transaction ↓ float x float y float z float d … float x float y float z float d … float x float y float z float d … float x0 float x1 float x2 float x3 … float y0 float y1 float y2 float y3 … float z0 float z1 float z2 float z3 … float d0 float d1 float d2 float d3 … … …
6
Shared Memory Shared memory는 on chip memory로써 global memory에 비해 접근 속도가 확연하게 빠름 Shared memory는 global memory와는 다른 구 조로 구성되어 있으며 bank conflicts가 발생 하지 않을경우 register와 비슷한 성능을 발 휘
7
Shared Memory : Usage
8
Shared Memory : Bank Conflict
Shared memory has 16 banks (compute capability 2.x = 32 bank) Shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously If two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized Shared memory features a broadcast mechanism whereby a 32-bit word can be read and broadcast to several threads simultaneously when servicing one memory read request bank 1 … 15 1 … 15 … Shared memory 4byte 4byte … 4byte 4byte 4byte … 4byte 4byte …
9
Shared Memory : Bank Conflict
Some example 1 2 3 4byte bank Shared memory … Threads 4 way bank conflict 4 way .. 1 2 3 4byte bank Shared memory … Threads Conflict free
10
Shared Memory : Bank Conflict
For devices of compute capability 2.x, multiple words can be broadcast in a single transaction (for devices of compute capability 1.x, single word can be broadcast in a single transaction) 1 2 3 4byte bank Shared memory … Threads Compute capability 1.x : 4 way bank conflict Compute capability 2.x : no bank conflict
Similar presentations