Hello, 

I witness a strange behavior when using slurm-2.6.0+blcr-0.8.5 and I was 
wondering whether this is normal or if anyone has any advice to offer.

When I submit a job (batch script with one srun_cr step - monothreaded, 
dynamically-linked simple program) , then checkpoint it and stop it with 
scontrol checkpoint vacate, I am able to restart with no problem only if it is 
reallocated to the same node it previously ran on. 

If the job is restarted on another node, it starts and then immediately stops 
without any error message nor in stdout/stderr nor in the log. It seems to 
Slurm everything went ok event though the program did not complete its 
execution. 

If I use the blcr commands inside the script, ( cr_run and cr_restart ), 
everthing is working fine whichever node it restarts on. 

Did any one face the same problem ? 

Thanks in advance, 

damien=

Reply via email to