Problem
Today when I was running PyTorch scripts, I met a strange problem:
1 | a = torch.rand(2, 2).to('cuda:1') |
but result in the following error:
1 | File "....../test.py", line 67, in <module> |
but It’s clear that GPU1 has enough memory (we only need to allocate 16 bytes!):
1 | |===============================+======================+======================| |
And normally, when we fail to allocate the memory for tensors, the error is:
1 | CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 6.00 GiB total capacity; 4.54 GiB already allocated; 14.94 MiB free; 4.64 GiB reserved in total by PyTorch) |
But our error message is much “simpler”. So what happened?
Possible Answer
This confused me for some time. According to this website:
When you initially do a CUDA call, it’ll create a cuda context and a THC context on the primary GPU (GPU0), and for that i think it needs 200 MB or so. That’s right at the edge of how much memory you have left.
Surprisingly, in my case, GPU0 has occupied 24222MiB / 24268MiB
memory. So there is no more memory for the context. In addition, this makes sense that out error message is RuntimeError: CUDA error: out of memory
, not the message that tensallocation failed.
Possible Solution
Set the CUDA_VISIBLE_DEVICES
environment variable. We need to change primary GPU (GPU0) to other one.
Method 1
In the starting python file:
1 | # Do this before `import torch` |
Method 2
In the shell:
1 | # Do this before run python |
And then, our program is ready to go.