0%

An Obscure RuntimeError for CUDA error: out of memory

Problem

Today when I was running PyTorch scripts, I met a strange problem:

1
2
3
a = torch.rand(2, 2).to('cuda:1')
......
torch.cuda.synchronize()

but result in the following error:

1
2
3
4
5
  File "....../test.py", line 67, in <module>
torch.cuda.synchronize()
File "....../miniconda3/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 495, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: out of memory

but It’s clear that GPU1 has enough memory (we only need to allocate 16 bytes!):

1
2
3
4
5
6
7
8
9
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1A:00.0 Off | N/A |
| 75% 73C P2 303W / 350W | 24222MiB / 24268MiB | 64% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:1B:00.0 Off | N/A |
| 90% 80C P2 328W / 350W | 15838MiB / 24268MiB | 92% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

And normally, when we fail to allocate the memory for tensors, the error is:

1
CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 6.00 GiB total capacity; 4.54 GiB already allocated; 14.94 MiB free; 4.64 GiB reserved in total by PyTorch)

But our error message is much “simpler”. So what happened?

Possible Answer

This confused me for some time. According to this website:

When you initially do a CUDA call, it’ll create a cuda context and a THC context on the primary GPU (GPU0), and for that i think it needs 200 MB or so. That’s right at the edge of how much memory you have left.

Surprisingly, in my case, GPU0 has occupied 24222MiB / 24268MiB memory. So there is no more memory for the context. In addition, this makes sense that out error message is RuntimeError: CUDA error: out of memory, not the message that tensallocation failed.

Possible Solution

Set the CUDA_VISIBLE_DEVICES environment variable. We need to change primary GPU (GPU0) to other one.

Method 1

In the starting python file:

1
2
3
# Do this before `import torch`
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1' # set to what you like, e.g., '1,2,3,4,5,6,7'

Method 2

In the shell:

1
2
# Do this before run python
export CUDA_VISIBLE_DEVICES=1 # set to what you like, e.g., '1,2,3,4,5,6,7'

And then, our program is ready to go.