FAQ
Frequently asked questions
Q: The trainer stopped and hasn’t progressed in several minutes.
A: Usually an issue with the GPUs communicating with each other. See the NCCL doc
Q: Exitcode -9
A: This usually happens when you run out of system RAM.
Q: Exitcode -7 while using deepspeed
A: Try upgrading deepspeed w:
pip install -U deepspeed
Q: AttributeError: ‘DummyOptim’ object has no attribute ‘step’
A: You may be using deepspeed with single gpu. Please don’t set
deepspeed:
in yaml or cli.
Q: The codes is stuck on saving preprocessed datasets.
A: This is usually an issue with the GPU. This can be resolved through setting the os environment variable
CUDA_VISIBLE_DEVICES=0
. If you are on runpod, this is usually a pod issue. Starting a new pod should take care of it.