Problem setting

When you just install a new version of PyTorch, you’d like to verify if you have installed it correctly. Typically, the following codes will be enough:

import torch
print(f"torch version: {torch.__version__}")
use_cuda = torch.cuda.is_available()
if use_cuda:
    GPU_nums = torch.cuda.device_count()
    GPU = torch.cuda.get_device_properties(0)
    print(f"There are {GPU_nums} GPUs in total.\nThe first GPU is: {GPU}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f'cudnn version: {torch.backends.cudnn.version()}')
device = torch.device(f"cuda:0" if use_cuda else "cpu")
print(f"Using {device} now!")

Sometimes although CUDA is available, there is still something wrong with cuDNN. This maybe induce error like below:

RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (8, 7, 0) but found runtime version (8, 1, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. Looks like your LD_LIBRARY_PATH contains incompatible version of cudnn. Please either remove it from the path or install cudnn (8, 7, 0)

Problem locating

According to the error message, let’s print the LD_LIBRARY_PATH and the return may appears like below:

$ $LD_LIBRARY_PATH
bash: /mnt/lustre/share/cuda-11.8/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64/:...

Let’s check each of them. First is the CUDA path:

$ ls /mnt/lustre/share/cuda-11.8/lib64
cmake                         libcufftw.so             libcurand.so.10.3.0.86       libnppial_static.a       libnppist.so.11.8.0.86   libnvperf_target.so
libaccinj64.so                libcufftw.so.10          libcurand_static.a           libnppicc.so             libnppist_static.a       libnvptxcompiler_static.a
libaccinj64.so.11.8           libcufftw.so.10.9.0.58   libcusolver_lapack_static.a  libnppicc.so.11          libnppisu.so             libnvrtc-builtins.so
libaccinj64.so.11.8.87        libcufftw_static.a       libcusolverMg.so             libnppicc.so.11.8.0.86   libnppisu.so.11          libnvrtc-builtins.so.11.8
libcheckpoint.so              libcufile_rdma.so        libcusolverMg.so.11          libnppicc_static.a       libnppisu.so.11.8.0.86   libnvrtc-builtins.so.11.8.89
libcublasLt.so                libcufile_rdma.so.1      libcusolverMg.so.11.4.1.48   libnppidei.so            libnppisu_static.a       libnvrtc-builtins_static.a
libcublasLt.so.11             libcufile_rdma.so.1.4.0  libcusolver.so               libnppidei.so.11         libnppitc.so             libnvrtc.so
libcublasLt.so.11.11.3.6      libcufile_rdma_static.a  libcusolver.so.11            libnppidei.so.11.8.0.86  libnppitc.so.11          libnvrtc.so.11.2
libcublasLt_static.a          libcufile.so             libcusolver.so.11.4.1.48     libnppidei_static.a      libnppitc.so.11.8.0.86   libnvrtc.so.11.8.89
libcublas.so                  libcufile.so.0           libcusolver_static.a         libnppif.so              libnppitc_static.a       libnvrtc_static.a
libcublas.so.11               libcufile.so.1.4.0       libcusparse.so               libnppif.so.11           libnpps.so               libnvToolsExt.so
libcublas.so.11.11.3.6        libcufile_static.a       libcusparse.so.11            libnppif.so.11.8.0.86    libnpps.so.11            libnvToolsExt.so.1
libcublas_static.a            libcufilt.a              libcusparse.so.11.7.5.86     libnppif_static.a        libnpps.so.11.8.0.86     libnvToolsExt.so.1.0.0
libcudadevrt.a                libcuinj64.so            libcusparse_static.a         libnppig.so              libnpps_static.a         libOpenCL.so
libcudart.so                  libcuinj64.so.11.8       liblapack_static.a           libnppig.so.11           libnvblas.so             libOpenCL.so.1
libcudart.so.11.0             libcuinj64.so.11.8.87    libmetis_static.a            libnppig.so.11.8.0.86    libnvblas.so.11          libOpenCL.so.1.0
libcudart.so.11.8.89          libculibos.a             libnppc.so                   libnppig_static.a        libnvblas.so.11.11.3.6   libOpenCL.so.1.0.0
libcudart_static.a            libcupti.so              libnppc.so.11                libnppim.so              libnvjpeg.so             libpcsamplingutil.so
libcufft.so                   libcupti.so.11.8         libnppc.so.11.8.0.86         libnppim.so.11           libnvjpeg.so.11          stubs
libcufft.so.10                libcupti.so.2022.3.0     libnppc_static.a             libnppim.so.11.8.0.86    libnvjpeg.so.11.9.0.86
libcufft.so.10.9.0.58         libcupti_static.a        libnppial.so                 libnppim_static.a        libnvjpeg_static.a
libcufft_static.a             libcurand.so             libnppial.so.11              libnppist.so             libnvperf_host.so
libcufft_static_nocallback.a  libcurand.so.10          libnppial.so.11.8.0.86       libnppist.so.11          libnvperf_host_static.a

and find there is no cuDNN at all. Look into the remain paths and finally we locate the imcompatible version of cuDNN:

$ ls /usr/local/cuda/lib64
libaccinj64.so               libcudnn_cnn_infer.so.8       libcufftw.so.10.4.1.152      libnppc.so                libnppig_static.a        libnvblas.so.11.4.1.1043
libaccinj64.so.11.2          libcudnn_cnn_infer.so.8.1.1   libcufftw_static.a           libnppc.so.11             libnppim.so              libnvjpeg.so
libaccinj64.so.11.2.152      libcudnn_cnn_train.so         libcuinj64.so                libnppc.so.11.3.2.152     libnppim.so.11           libnvjpeg.so.11
libcublasLt.so               libcudnn_cnn_train.so.8       libcuinj64.so.11.2           libnppc_static.a          libnppim.so.11.3.2.152   libnvjpeg.so.11.4.0.152
libcublasLt.so.11            libcudnn_cnn_train.so.8.1.1   libcuinj64.so.11.2.152       libnppial.so              libnppim_static.a        libnvjpeg_static.a
libcublasLt.so.11.4.1.1043   libcudnn_ops_infer.so         libculibos.a                 libnppial.so.11           libnppist.so             libnvptxcompiler_static.a
libcublasLt_static.a         libcudnn_ops_infer.so.8       libcurand.so                 libnppial.so.11.3.2.152   libnppist.so.11          libnvrtc-builtins.so
libcublas.so                 libcudnn_ops_infer.so.8.1.1   libcurand.so.10              libnppial_static.a        libnppist.so.11.3.2.152  libnvrtc-builtins.so.11.2
libcublas.so.11              libcudnn_ops_train.so         libcurand.so.10.2.3.152      libnppicc.so              libnppist_static.a       libnvrtc-builtins.so.11.2.152
libcublas.so.11.4.1.1043     libcudnn_ops_train.so.8       libcurand_static.a           libnppicc.so.11           libnppisu.so             libnvrtc.so
libcublas_static.a           libcudnn_ops_train.so.8.1.1   libcusolverMg.so             libnppicc.so.11.3.2.152   libnppisu.so.11          libnvrtc.so.11.2
libcudadevrt.a               libcudnn.so                   libcusolverMg.so.11          libnppicc_static.a        libnppisu.so.11.3.2.152  libnvrtc.so.11.2.152
libcudart.so                 libcudnn.so.8                 libcusolverMg.so.11.1.0.152  libnppidei.so             libnppisu_static.a       libnvToolsExt.so
libcudart.so.11.0            libcudnn.so.8.1.1             libcusolver.so               libnppidei.so.11          libnppitc.so             libnvToolsExt.so.1
libcudart.so.11.2.152        libcudnn_static.a             libcusolver.so.11            libnppidei.so.11.3.2.152  libnppitc.so.11          libnvToolsExt.so.1.0.0
libcudart_static.a           libcudnn_static_v8.a          libcusolver.so.11.1.0.152    libnppidei_static.a       libnppitc.so.11.3.2.152  libOpenCL.so
libcudnn_adv_infer.so        libcufft.so                   libcusolver_static.a         libnppif.so               libnppitc_static.a       libOpenCL.so.1
libcudnn_adv_infer.so.8      libcufft.so.10                libcusparse.so               libnppif.so.11            libnpps.so               libOpenCL.so.1.0
libcudnn_adv_infer.so.8.1.1  libcufft.so.10.4.1.152        libcusparse.so.11            libnppif.so.11.3.2.152    libnpps.so.11            libOpenCL.so.1.0.0
libcudnn_adv_train.so        libcufft_static.a             libcusparse.so.11.4.1.1152   libnppif_static.a         libnpps.so.11.3.2.152    nvrtc-prev
libcudnn_adv_train.so.8      libcufft_static_nocallback.a  libcusparse_static.a         libnppig.so               libnpps_static.a         stubs
libcudnn_adv_train.so.8.1.1  libcufftw.so                  liblapack_static.a           libnppig.so.11            libnvblas.so
libcudnn_cnn_infer.so        libcufftw.so.10               libmetis_static.a            libnppig.so.11.3.2.152    libnvblas.so.11

Problem solving

Again, according to the error message, PyTorch is installed with the correct version of cuDNN bundled. To “ensure PyTorch can find the bundled cuDNN”, we need to find the bundled lib and add it to LD_LIBRARY_PATH.

If you are using conda, the bundled lib can be found at site-packages/torch/lib under your environment directory:

$ ls [/path/to/your/conda]/envs/[env-name]/lib/python3.9/site-packages/torch/lib
libc10_cuda.so      libcublas.so.11             libcudnn_cnn_infer.so.8  libcudnn.so.8              libnvrtc-builtins.so.11.8    libtorch_cuda_linalg.so  libtorch.so
libc10.so           libcudart-d0da41ae.so.11.0  libcudnn_cnn_train.so.8  libgomp-a34b3233.so.1      libnvToolsExt-847d78f2.so.1  libtorch_cuda.so
libcaffe2_nvrtc.so  libcudnn_adv_infer.so.8     libcudnn_ops_infer.so.8  libnvfuser_codegen.so      libshm.so                    libtorch_global_deps.so
libcublasLt.so.11   libcudnn_adv_train.so.8     libcudnn_ops_train.so.8  libnvrtc-672ee683.so.11.2  libtorch_cpu.so              libtorch_python.so

The last thing is to modify the environment variables, i.e., add the following lines to your ~/.bashrc:

# cuDNN
export MY_TORCH_LIB="[/path/to/your/conda]/envs/[env-name]/lib/python3.9/site-packages/torch/lib"
export LD_LIBRARY_PATH=$MY_TORCH_LIB:$LD_LIBRARY_PATH

Don’t use export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MY_TORCH_LIB becasue we want $MY_TORCH_LIB to have a higher priority.

Finally, don’t forget to call source ~/.bashrc, which adds the new LD_LIBRARY_PATH to the environment variables.