CUDA 程序第一次运行很慢

https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/

CUDA 程序有两种代码:一是设备无关的 PTX,二是设备有关的二进制代码。在运行之前将 PTX 编译成二进制代码就是 JIT 过程。(当然,nvcc 会在文件系统里面存储 cache。)用 -arch=sm_xx 可以只为给定的架构编译,从而运行时不需要 JIT 这一步。

禁用 cache(不允许本次运行读取和写入 cache):

export CUDA_CACHE_DISABLE=1

然后分别使用 -arch=native 选项和不使用,可以明显感受到程序在启动速度上的差别。

nvcc --help:

nvcc embeds a compiled code image in the resulting executable for each specified <code> architecture, which is a true binary load image for each ‘real’ architecture (such as sm_50), and PTX code for the ‘virtual’ architecture (such as compute_50). During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no binary load image is found for the ‘current’ GPU.

Note

-arch 选项生成二进制代码,而 -code 选项生成虚拟的 PTX 代码。