0%

CUDA No process found but GPU memory occupied

Problem

when typing nvidia-smi, we find there is GPU memory occupied (See Red Boxes), but we cannot see any relevant process on that GPU (See Orange Boxes).

problem

Possible Answer

This can be caused by torch.distributed and other multi-processing CUDA programs. When the main process terminated, the background process still alive, not killed.

  1. To figure which processes used the GPU, we can use the following command:
1
2
3
4
5
6
7
8
9
fuser -v /dev/nvidia<id>
# OUTPUT
Users PID Command
/dev/nvidia5: XXXXXX 14701 F...m python
XXXXXX 14703 F...m python
XXXXXX 14705 F...m python
XXXXXX 14706 F...m python
XXXXXX 37041 F...m python
XXXXXX 37053 F...m python

This will list all of the processes that use GPU. Note that if this is executed from a normal user, then only the user’s processes displayed. If this is executed from root, then all user’s relevant processes will be displayed.

  1. Then use the following command to kill the process shown above.
1
kill -9 [PID]

That will kill the process on the GPU. After killing the processes, you will find the GPU memory is freed. If still occupied, this may be caused by other users. You need to ask other users/administrators to kill it manually.