记录CUDA面试基础

CUDA面试

1. 硬件概念

SP (cuda core （ALU + FLU）) ) SFU LD/ST
SM (多个SP + warp scheduler + register + memory)
通过多个线程切换隐藏IO延迟，线程切换开销小因为所需资源已经分配
图片名称

1.1 IO

register file （in chip 仅对thread可见）
1. register数目不够会spill到local memory
shared memory (in chip 对block可见)
1. 可以设置shared memory大小，剩下的自动成为L1 cache
constant memory ( 仅对特定场景有用)
texture memory ( 仅对特定场景有用)
local memory （off chip 仅对thread可见）/ global memory (off chip全局可见)

1.2 compute capabality

compute capabality代表gpu card支持的feature数目
参考链接：https://en.wikipedia.org/wiki/CUDA

1.3 GPU架构变化：

telsa
Fermi
kepler
maxwell
turing

2. 软件概念

thread是基本执行单元
block -> SM （__syncthreads()可以用于同步block）
grid -> device

2.1 SIMT

CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data

2.2 分支

3. 计算性能

斜率 byte/s 带宽
横轴 flop/byte 计算密度
纵轴 flop/s 计算性能

4. CUDA运行性能分析

4.1 bank conflict（shared memory）

分割为bank, 一个bank有4个字节（可以设置），一个bank在一个时钟周期只能访问一次，如果warp所有线程访问同一个bank会引发bank conflict

Diagram of NVIDIA Kepler GPU architecture Shared Memory and L1 Cache Memory

参考链接： https://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/

特殊情况：所有线程访问同一个位置会broadcast而不会conflict

4.2 memory coaleasing (global memory)

memory coalescing (compute capabality 1.x cuda 2.0)

线程访问相邻的global memory可以产生memory coalescing，比如线程1，2，3访问global memory中的1，2， 3

4.3 warp occupancy

warp occupancy = warps per block / max warps per block

4.4 CUDA流

支持DeviceOverlap的可以使用cuda流