Problem setting

When you just install a new version of PyTorch, you’d like to verify if you have installed it correctly. Typically, the following codes will be enough:

import torch
print(f"torch version: {torch.__version__}")
use_cuda = torch.cuda.is_available()
if use_cuda:
    GPU_nums = torch.cuda.device_count()
    GPU = torch.cuda.get_device_properties(0)
    print(f"There are {GPU_nums} GPUs in total.\nThe first GPU is: {GPU}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f'cudnn version: {torch.backends.cudnn.version()}')
device = torch.device(f"cuda:0" if use_cuda else "cpu")
print(f"Using {device} now!")

Sometimes although CUDA is available, there is still something wrong with cuDNN. This maybe induce error like below:

RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (8, 7, 0) but found runtime version (8, 1, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. Looks like your LD_LIBRARY_PATH contains incompatible version of cudnn. Please either remove it from the path or install cudnn (8, 7, 0)

Problem locating

According to the error message, let’s print the LD_LIBRARY_PATH and the return may appears like below:

bash: /mnt/lustre/share/cuda-11.8/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64/:...

Let’s check each of them. First is the CUDA path:

$ ls /mnt/lustre/share/cuda-11.8/lib64
cmake                         libnppial_static.a                libcurand_static.a              libnppist_static.a       libnvptxcompiler_static.a    libcusolver_lapack_static.a           libcufftw_static.a                 libnppicc_static.a                  libnppisu_static.a       libnvrtc-builtins_static.a                 libcufile_rdma_static.a  
libcublasLt_static.a        libnppidei_static.a                   libcusolver_static.a              libnppitc_static.a       libnvrtc_static.a                        libcufile_static.a    
libcublas_static.a            libcufilt.a         libnppif_static.a
libcudadevrt.a                  libcusparse_static.a              libnpps_static.a               liblapack_static.a            libmetis_static.a            libculibos.a                      libnppig_static.a
libcudart_static.a                                       stubs           libnppc_static.a            libcupti_static.a                 libnppim_static.a        libnvjpeg_static.a
libcufft_static_nocallback.a          libnvperf_host_static.a

and find there is no cuDNN at all. Look into the remain paths and finally we locate the imcompatible version of cuDNN:

$ ls /usr/local/cuda/lib64                     libnppig_static.a   libcufftw_static.a                               libnppc_static.a                libnppim_static.a        libnvjpeg_static.a         libculibos.a                     libnvptxcompiler_static.a
libcublasLt_static.a                            libnppial_static.a                  libnppist_static.a       libcurand_static.a     
libcudadevrt.a                        libnppicc_static.a                           libnppisu_static.a                     libcudnn_static.a     
libcudart_static.a           libcudnn_static_v8.a    libnppidei_static.a                   libcusolver_static.a               libnppitc_static.a                              libcufft_static.a      libnppif_static.a    nvrtc-prev      libcufft_static_nocallback.a  libcusparse_static.a               libnpps_static.a         stubs                  liblapack_static.a                  libmetis_static.a  

Problem solving

Again, according to the error message, PyTorch is installed with the correct version of cuDNN bundled. To “ensure PyTorch can find the bundled cuDNN”, we need to find the bundled lib and add it to LD_LIBRARY_PATH.

If you are using conda, the bundled lib can be found at site-packages/torch/lib under your environment directory:

$ ls [/path/to/your/conda]/envs/[env-name]/lib/python3.9/site-packages/torch/lib                      

The last thing is to modify the environment variables, i.e., add the following lines to your ~/.bashrc:

# cuDNN
export MY_TORCH_LIB="[/path/to/your/conda]/envs/[env-name]/lib/python3.9/site-packages/torch/lib"

Don’t use export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MY_TORCH_LIB becasue we want $MY_TORCH_LIB to have a higher priority.

Finally, don’t forget to call source ~/.bashrc, which adds the new LD_LIBRARY_PATH to the environment variables.