Thanks all- this patch fixed the errors we were having.
On Mon, Jun 3, 2013 at 8:36 AM, Moe Jette <[email protected]> wrote: > > Thank you for the patch. This will be in versions 2.5.7 and 2.6.0-rc1 > (release candidate 1), both to be released this week. > > Quoting Hongjia Cao <[email protected]>: > > > This is a problem with the checkpointing code in SLURM. The attached > > patch fixes it. > > > > 在 2013-05-29三的 12:50 -0700,Michael Gutteridge写道: > >> We're having some trouble getting our slurm jobs to successfully > >> restart after a checkpoint. For this test, I'm using sbatch and a > >> simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. > >> > >> I'm submitting the job using sbatch: > >> > >> > >> $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh > >> > >> > >> > >> I am able to create the checkpoint and vacate the node: > >> > >> > >> $ scontrol checkpoint create 137 > >> > >> > >> .... time passes .... > >> > >> > >> $ scontrol vacate 137 > >> > >> > >> At that point, I see the checkpoint file from blcr in the current > >> directory and the checkpoint file from Slurm > >> in /var/spool/slurm-llnl/checkpoint. However, when I attempt to > >> restart the job: > >> > >> > >> $ scontrol checkpoint restart 137 > >> scontrol_checkpoint error: Node count specification invalid > >> > >> > >> In slurmctld's log (at level 7) I see: > >> > >> > >> [2013-05-29T12:41:08-07:00] debug2: Processing RPC: > >> REQUEST_CHECKPOINT(restart) from uid=***** > >> [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header > >> is JOB_CKPT_002 > >> [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 > >> [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node > >> count specification invalid > >> > >> > >> Any insights would be appreciated. > >> > >> > >> Thanks > >> > >> > >> Michael > >> > >> > >> > >> > > > > > > -- Hey! Somebody punched the foley guy! - Crow, MST3K ep. 508
