Actually our cluster is very homogeneous too (same CentOS6.4 image for all nodes) and blcr works well independently of Slurm, and also well when I use cr_crun and cr_restart inside a Slurm submission script.
It is only when using srun_cr and scontrol checkpoint restart that I encounter this problem. damien On 10 Oct 2013, at 15:50, Michael Gutteridge wrote: > Does this issue apply to your environment: > > https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink > > We've been running blcr in test for some time. We haven't encountered this > issue on our platform (Ubuntu 12.04.01), but our systems are extremely > homogenous. > > Hope this helps > > M > > > On Thu, Oct 10, 2013 at 12:18 AM, Damien François > <[email protected]> wrote: > > Hello, > > I witness a strange behavior when using slurm-2.6.0+blcr-0.8.5 and I was > wondering whether this is normal or if anyone has any advice to offer. > > When I submit a job (batch script with one srun_cr step - monothreaded, > dynamically-linked simple program) , then checkpoint it and stop it with > scontrol checkpoint vacate, I am able to restart with no problem only if it > is reallocated to the same node it previously ran on. > > If the job is restarted on another node, it starts and then immediately stops > without any error message nor in stdout/stderr nor in the log. It seems to > Slurm everything went ok event though the program did not complete its > execution. > > If I use the blcr commands inside the script, ( cr_run and cr_restart ), > everthing is working fine whichever node it restarts on. > > Did any one face the same problem ? > > Thanks in advance, > > damien= > > > > -- > Hey! Somebody punched the foley guy! > - Crow, MST3K ep. 508 >
