I'm afraid that you will need to drain the entire node for now or configure out the bad GPU and restart the slurmd on that node and the slurmctld.
Quoting Sa Li <[email protected]>: > Hi, slurm fans > > I have been running a cluster containing lot of gpu machines, and slurm > manages to throw jobs into each GPUs. Things seem to be OK, except > if one of gpu fails for some reason, but slurm wont be able to know that, > and resulting the jobs are constantly sent to that bad gpu which actually > kills them all. > > If you see the squeue, all the jobs can be stopped and eaten by that bad > gpu instead of sending to other GPUs, since the GPUs are receiving the job > in term of the incremental device name. If the gpu2 fails, then gpu3, gpu3 > could just slack off there receiving nothing. I don't know if there is a > fault tolerance mechanism to handle that kind of node/gpu failure. > > thanks > **-- > *Sa Li* > *Senior Research Developer* > > www.pof.com > <http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/325722182701/> > > *P: *778.838.1018 | *AIM: *[email protected] <[email protected]> | *Skype: > *sa_li_cn > | *Fb: > *http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/471941507360/ > > STRICTLY PERSONAL AND CONFIDENTIAL. This email and any files transmitted > with it may contain confidential and proprietary material for the sole use > of the intended recipient. Any review or distribution by others is strictly > prohibited. If you are not the intended recipient please contact the sender > and delete all copie >
