CUDA by Example: Chapter 06-08

发表于： 2024-02-06 更新于： 2024-04-08 分类于： cuda-by-example

字数： 1815 阅读：≈ 4分钟

第 6 章 Constant Memory and Events

常量内存

常量内存是在全局区域声明的。如果漏掉了 __constant__ 关键字，就会将其定义在全局内存区域，尽管存储方式、分配的时机和用 cudaMalloc 申请的内存有一些差异。

__constant__ Sphere s[SPHERES];

常量内存的内存拷贝方法比较特殊：

HANDLE_ERROR( cudaMemcpyToSymbol( s, temp_s,
                                  sizeof(Sphere) * SPHERES) );

CUDA 线程对常量内存是只读的，也就是只有 host 能操作常量内存。通过将反复读取的数据移动到常量内存区域而不是全局内存，可以加速。但是要注意常量内存的大小非常有限（）。书中的例子只是对 20 个球体做光线追踪。

常量内存为什么快：

有 constant cache。
对常量内存的读取可以被广播到临近的线程。节省读取次数。

什么叫做临近线程？GPU 中线程执行是按照 warp 分组的。如果同一组的都需要同一个 constant memory 中的数据，那么 GPU 只会产生一个访问请求，这样会很节省带宽。（需要全局内存时难道不能用这个方法节省带宽吗？）

In the CUDA Architecture, a warp refers to a collection of 32 threads that are “woven together” and get executed in lockstep. At every line in your program, each thread in a warp executes the same instruction on different data.
When it comes to handling constant memory, NVIDIA hardware can broadcast a single memory read to each half-warp. A half-warp—not nearly as creatively named as a warp—is a group of 16 threads: half of a 32-thread warp. If every thread in a half-warp requests data from the same address in constant memory, your GPU will generate only a single read request and subsequently broadcast the data to every thread.

常量内存广播策略也可能使得性能降低，比如在半 warp 中每个线程访问不同内存地址的时候，请求不仅不能合并，还只能序列化发出：

Unfortunately, there can potentially be a downside to performance when using constant memory. The half-warp broadcast feature is in actuality a double-edged sword. Although it can dramatically accelerate performance when all 16 threads are reading the same address, it actually slows performance to a crawl when all 16 threads read different addresses.
For example, if all 16 threads in a half-warp need different data from constant memory, the 16 different reads get serialized, effectively taking 16 times the amount of time to place the request. If they were reading from conventional global memory, the request could be issued at the same time. In this case, reading from constant memory would probably be slower than using global memory.

CUDA 事件

cudaEvent_t     start, stop;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

// ...
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
float   elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
                                    start, stop ) );
printf( "Time to generate:  %3.1f ms\n", elapsedTime );

HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );

cudaEventSynchronize 非常重要，因为 host 和 device 之间是异步执行的。

cudaEvent 发生在 device 上，只适合记录 device 上的执行（也就是 GPU 任务），用它来计算 CPU 上的时间是不正确的。

测试结果：用了常量内存之后，在 Debug 模式下，书中给的代码执行时间从 645.1ms 和 3.3ms 降低到了 466ms 和 2.7ms。（较大的那个时间可能是 GPU 调度的时间，也可能是 Windows 上双显卡设备中独显从不活跃状态转换成活跃状态的时间。不清楚原因。）没有书里面降低的那么夸张，但是还是有很大的降低。

第 7 章 Texture Memory

纹理内存也是只读和有缓存的。纹理内存要和全局内存绑定，所以说纹理内存对应的位置并不是完全不可以被修改，只是透过纹理内存的访问机制无法被修改而已。

Like constant memory, texture memory is cached on chip, so in some situations it will provide higher effective bandwidth by reducing memory requests to off-chip DRAM.

纹理内存对空间中临近地址的访问有很好的局部性支持。对于一般的 CPU 来说，这样的访问很不利于 cache，但是纹理内存的缓存机制不同。因而当传统的 cache 机制对于要解决的问题不友好时，可以考虑 texture memory。

书上纹理内存相关的代码不能编译，据说是 CUDA 12 的 API 有了变化。

NVIDIA removed support for texture references in CUDA 12.0. NVIDIA told CUDA programmers that they should switch to texture objects (as they planned to remove texture references) for five years prior to that.

https://stackoverflow.com/a/67197817/

书中的实现方式（在 CUDA 11.3 废弃）

一维纹理内存

首先创建 texture。

// 1D 纹理引用
texture<float> texConstSrc;
texture<float> texIn;
texture<float> texOut;

然后将 texture 用 cudaBindTexture 绑定到已经申请的全局内存上。

HANDLE_ERROR( cudaBindTexture( NULL, texConstSrc,
                               data.dev_constSrc,
                               imageSize ) );

And since texture references must be declared globally at file scope, we can no longer pass the input and output buffers as parameters to blend_kernel() because the compiler needs to know at compile time which textures tex1Dfetch() should be sampling.

这里的 blend_kernel 是书中要实现的一个函数，而 tex1Dfetch 是 CUDA 的 compiler intrinsic，用来读取纹理内存。

书中用到了 texIn 和 texOut 两个缓冲区，用来实现 2D 热交换的模拟。但是因为 tex1Dfetch 的纹理引用参数需要在编译时确定（offset 参数可以动态确定），所以使用了一个 bool 标志，以近似于复制粘贴的方式写了两个分支：

结束的时候要用 cudaUnbindTexture 来解除纹理内存的绑定。

二维纹理内存

声明的时候多了一个维度参数。

texture<float,2> texConstSrc;
texture<float,2> texIn;
texture<float,2> texOut;

读取则使用 tex2D，和 tex1DFetch 很类似，但是接受两个下标参数（这样就不用自己计算 offset）。

绑定的方式也发生了变化，增加了 cudaChannelFormatDesc 和 2D 的长度（之前是 1D，所以只需要给出总长度）：

cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();
HANDLE_ERROR( cudaBindTexture2D( NULL, texConstSrc,
                                 data.dev_constSrc,
                                 desc, DIM, DIM,
                                 sizeof(float) * DIM ) );

最后同样是要接触纹理内存的绑定。

根据书中所述，1D 纹理内存和 2D 纹理内存的性能相同。所以应该根据代码编写的方便性来选择。

CUDA 12 之后的实现

参考 https://developer.nvidia.com/blog/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/

Kepler GPUs and CUDA 5.0 introduce a new feature called texture objects (sometimes called bindless textures, since they don’t require manual binding/unbinding) that greatly improves the usability and programmability of textures.

纹理对象（cudaTextureObject_t）不需要在全局区域声明，不需要绑定和解绑，也不需要在编译时知道要操作的纹理对象，因而更加灵活。

访问的时候还是用同样的 tex1Dfetch 和 tex2D 函数模板，但是是另外一个重载版本。

TODO（操作起来要复杂一点，待补充）

第 8 章 CUDA 和图形编程库的可互操作性

很多 API 都过时了。主要讲的是 OpenGL，DirectX 也稍微提了一下。