[slurm-dev] Re: Strange behavior with slurm+blcr

Damien François Fri, 11 Oct 2013 00:07:21 -0700

Actually our cluster is very homogeneous too (same CentOS6.4 image for all 
nodes) and blcr works well independently of Slurm, and also well when I use 
cr_crun and cr_restart inside a Slurm submission script.


It is only when using srun_cr and scontrol checkpoint restart that I encounter 
this problem.

damien


On 10 Oct 2013, at 15:50, Michael Gutteridge wrote:

> Does this issue apply to your environment:
> 
> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink
> 
> We've been running blcr in test for some time.  We haven't encountered this 
> issue on our platform (Ubuntu 12.04.01), but our systems are extremely 
> homogenous.
> 
> Hope this helps
> 
> M
> 
> 
> On Thu, Oct 10, 2013 at 12:18 AM, Damien François 
> <[email protected]> wrote:
> 
> Hello,
> 
> I witness a strange behavior when using slurm-2.6.0+blcr-0.8.5 and I was 
> wondering whether this is normal or if anyone has any advice to offer.
> 
> When I submit a job (batch script with one srun_cr step - monothreaded, 
> dynamically-linked simple program) , then checkpoint it and stop it with 
> scontrol checkpoint vacate, I am able to restart with no problem only if it 
> is reallocated to the same node it previously ran on.
> 
> If the job is restarted on another node, it starts and then immediately stops 
> without any error message nor in stdout/stderr nor in the log. It seems to 
> Slurm everything went ok event though the program did not complete its 
> execution.
> 
> If I use the blcr commands inside the script, ( cr_run and cr_restart ), 
> everthing is working fine whichever node it restarts on.
> 
> Did any one face the same problem ?
> 
> Thanks in advance,
> 
> damien=
> 
> 
> 
> -- 
> Hey! Somebody punched the foley guy!
>    - Crow, MST3K ep. 508
>

[slurm-dev] Re: Strange behavior with slurm+blcr

Reply via email to