So there are different options you can set for Return to Service in the slurm.conf which can effect how the node is handled on reconnect.  You can also up the timeouts for the daemons.

-Paul Edmon-


On 8/31/2018 5:06 PM, Renfro, Michael wrote:
Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs provided by Bright Computing, 
if it matters) with both gigabit Ethernet and Infiniband interfaces. Twice in 
the last year, I’ve had a failure inside the stacked Ethernet switches that’s 
caused Slurm to lose track of node and job state. Jobs kept running as normal, 
since all file traffic is on the Infiniband network.

In both cases, I wasn’t able to cleanly recover. On the first outage, my 
attempt at recovery (pretty sure I forcibly drained and resumed the nodes) 
caused all active jobs to be killed, and then the next group of queued jobs to 
start. On the second outage, all active jobs were restarted from scratch, 
including truncating and overwriting any existing output. I think that involved 
my restarting slurmd or slurmctld services, but I’m not certain.

I’ve built a VM test environment with OpenHPC and Slurm 17.11 to simulate these 
kinds of failures, but haven’t been able to reproduce my earlier results. After a 
sufficiently long network outage, I get downed nodes with "Reason=Duplicate 
jobid”.

Basically, I’d like to know what the proper procedure is for recovering from 
this kind of outage in the Slurm control network without losing the output from 
running jobs. Not sure if I can easily add any redundancy in the Ethernet 
network, but I may be able to add in the Infiniband network for control if 
that’s supported. Thanks.



Reply via email to