Hello, I witness a strange behavior when using slurm-2.6.0+blcr-0.8.5 and I was wondering whether this is normal or if anyone has any advice to offer.
When I submit a job (batch script with one srun_cr step - monothreaded, dynamically-linked simple program) , then checkpoint it and stop it with scontrol checkpoint vacate, I am able to restart with no problem only if it is reallocated to the same node it previously ran on. If the job is restarted on another node, it starts and then immediately stops without any error message nor in stdout/stderr nor in the log. It seems to Slurm everything went ok event though the program did not complete its execution. If I use the blcr commands inside the script, ( cr_run and cr_restart ), everthing is working fine whichever node it restarts on. Did any one face the same problem ? Thanks in advance, damien=
