CUDA by Example: Chapter 01-05

发表于： 2024-02-06 更新于： 2024-04-08 分类于： cuda-by-example

字数： 1304 阅读：≈ 3分钟

源码可以参考 https://github.com/yottaawesome/cuda-by-example/ ，官网的源码链接挂了。

书中的代码有些需要用 opengl 来跑。安装了 freeglut3-dev 和 mesa-utils。（不确定 libgl1-mesa-dev 是否是必要的。）然后 cmake 规则中要 link 对应的库：

cmake_minimum_required(VERSION 3.20.1)
project(chapter3 LANGUAGES CXX CUDA)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
set(CMAKE_CUDA_EXTENSIONS OFF)
add_executable(ray ray_global.cu)
target_link_libraries(ray GL glut)
#                         ^^^^^^^

第 3 章 CUDA 源文件

用 nvcc 编译时不需要为 cuda 内置函数额外包含头文件。这些头文件是在 host 端才需要的。

https://stackoverflow.com/questions/6302695/difference-between-cuda-h-cuda-runtime-h-cuda-runtime-api-h

In very broad terms:
cuda.h defines the public host functions and types for the CUDA driver API.
cuda_runtime_api.h defines the public host functions and types for the CUDA runtime API
cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions.
If you were writing host code to be compiled with the host compiler which includes API calls, you would include either cuda.h or cuda_runtime_api.h. If you needed other CUDA language built-ins, like types, and were using the runtime API and compiling with the host compiler, you would include cuda_runtime.h. If you are writing code which will be compiled using nvcc, it is all irrelevant, because nvcc takes care of inclusion of all the required headers automatically without programmer intervention.

Warning

书中没讲如何编译，需要预备知识。至少要会像 gcc 一样用 nvcc。

Note

没有 compile_commands.json 时 clangd 反而能够为 cu 文件提供诊断？好像是生成的 compile_commands.json 有太多 clang 无法理解的选项，但是 clang 没有忽略它们，导致了诊断问题。

Note

遇到了 wsl 中无法在 cmake 中使用 debug（可以直接 run）的情况，原来是没有安装 C/C++ intellisense。同时也只有安装了这个才能在 .cu 文件中打断点。

不过怎么会没有安装呢？可能是上次 compact 虚拟磁盘之后出现的后遗症。

第 4 章核函数

With kernel<<<256,1>>>(), you would get 256 blocks running on the GPU.

dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
//       ^^^  ^
//  gridDim  blockDim
// =blocks  =threads
//  都是 dim3 类型

// 用 gridDim.{x,y,z} 来访问 blocks 在该方向上的宽度
// 用 blockIdx.{x,y,z} 来访问从 0 开始的下标
// dim 对应维度，idx 对应下标

// 这是不是说明一个 kernel 只能在一个 grid 上跑？

// ...
__global__ void kernel( unsigned char *ptr ) {
  // map from threadIdx/BlockIdx to pixel position
  int x = blockIdx.x;
  int y = blockIdx.y;
  int offset = x + y * gridDim.x;
  // ...
}

第 5 章高维 dim3 和共享内存

每个 block 中的最大线程数是比较小的，在我的 1050 Ti 上是 1024。而 blocks 的数量却可以指定相当多（一般不用担心超限）。所以只用一个 block 完成不了任务时有必要使用多个 blocks。

int tid = threadIdx.x + blockIdx.x * blockDim.x;

创建数量刚刚好的 blocks 和 threads（假设 threadsPerBlock 已经确定为 128）：

add<<< (N+127)/128, 128 >>>( dev_a, dev_b, dev_c );

如果总线程的数量还是不够，就改成循环，每个线程处理一个以上的元素：

__global__ void add( int *a, int *b, int *c ) {
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  while (tid < N) {
    c[tid] = a[tid] + b[tid];
    tid += blockDim.x * gridDim.x;
  }
}

按照输出的形状对线程分组

在用上 dim3 的第二维时的偏置计算方式如下（以计算图像上 (x, y) 点的像素颜色为例）：

int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
// 线程的形状是 (gridDim.x, gridDim.y, blockDim.x, blockDim.y)
// 计算任务的输出形状是 (gridDim.x * blockDim.x, gridDim.y * gridDim.y)

这样看起来计算任务要处理几维就分配几维，这样比较方便。书中的例子最后巧妙地将 .x 都分给了输出形状的第 0 维，而 .y 是第 1 维。但是维数最多只有 3 维，复杂的计算任务只能用 1 维自己计算映射关系了。

2024 年 2 月 13 日：注意 offset 的计算方式是 y * … 而不是 x *，这是因为 CUDA 中线程的分组方式是 z 在 y 的外围，y 在 x 的外围。

共享内存

用 __shared__ 声明要存储在共享内存上的变量。共享内存允许同一个 block 上的线程的同步（__syncthreads()），快于全局内存。也就是几十 KB 的样子。共享内存是在核函数中对变量声明的。

每个块都有共享内存中变量的私有副本，不同块之间是不能相互访问共享内存的。

Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks.

例子：dot 操作。略

比较经典。

例子：dot 操作的错误优化。略

Note

也有 __syncwarp 函数。

为什么需要 sync warp，warp 中的线程不是按照一个步调执行的吗？这个特性在 CUDA 9 出现，它对于 divergent warps 有用，并且 Volta 架构已经支持了 warp 内的乱序执行。

第 3 章 CUDA 源文件

第 4 章 核函数

第 5 章 高维 dim3 和共享内存

按照输出的形状对线程分组

共享内存

第 4 章核函数

第 5 章高维 dim3 和共享内存