CUDA 程序第一次运行很慢

发表于： 2024-02-12 更新于： 2024-08-18

字数： 260 阅读：≈ 1分钟

https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/

CUDA 程序有两种代码：一是设备无关的 PTX，二是设备有关的二进制代码。在运行之前将 PTX 编译成二进制代码就是 JIT 过程。（当然，nvcc 会在文件系统里面存储 cache。）用 -arch=sm_xx 可以只为给定的架构编译，从而运行时不需要 JIT 这一步。

禁用 cache（不允许本次运行读取和写入 cache）：

export CUDA_CACHE_DISABLE=1

然后分别使用 -arch=native 选项和不使用，可以明显感受到程序在启动速度上的差别。

nvcc --help:

nvcc embeds a compiled code image in the resulting executable for each specified <code> architecture, which is a true binary load image for each ‘real’ architecture (such as sm_50), and PTX code for the ‘virtual’ architecture (such as compute_50). During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no binary load image is found for the ‘current’ GPU.

Note

-arch 选项生成二进制代码，而 -code 选项生成虚拟的 PTX 代码。