Thanks all- this patch fixed the errors we were having.


On Mon, Jun 3, 2013 at 8:36 AM, Moe Jette <[email protected]> wrote:

>
> Thank you for the patch. This will be in versions 2.5.7 and 2.6.0-rc1
> (release candidate 1), both to be released this week.
>
> Quoting Hongjia Cao <[email protected]>:
>
> > This is a problem with the checkpointing code in SLURM. The attached
> > patch fixes it.
> >
> > 在 2013-05-29三的 12:50 -0700,Michael Gutteridge写道:
> >> We're having some trouble getting our slurm jobs to successfully
> >> restart after a checkpoint.  For this test, I'm using sbatch and a
> >> simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
> >>
> >> I'm submitting the job using sbatch:
> >>
> >>
> >> $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
> >>
> >>
> >>
> >> I am able to create the checkpoint and vacate the node:
> >>
> >>
> >> $ scontrol checkpoint create 137
> >>
> >>
> >> .... time passes ....
> >>
> >>
> >> $ scontrol vacate 137
> >>
> >>
> >> At that point, I see the checkpoint file from blcr in the current
> >> directory and the checkpoint file from Slurm
> >> in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
> >> restart the job:
> >>
> >>
> >> $ scontrol checkpoint restart 137
> >> scontrol_checkpoint error: Node count specification invalid
> >>
> >>
> >> In slurmctld's log (at level 7) I see:
> >>
> >>
> >> [2013-05-29T12:41:08-07:00] debug2: Processing RPC:
> >> REQUEST_CHECKPOINT(restart) from uid=*****
> >> [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header
> >> is JOB_CKPT_002
> >> [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
> >> [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node
> >> count specification invalid
> >>
> >>
> >> Any insights would be appreciated.
> >>
> >>
> >> Thanks
> >>
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >
> >
>
>


-- 
Hey! Somebody punched the foley guy!
   - Crow, MST3K ep. 508

Reply via email to