Thank you for the patch. This will be in versions 2.5.7 and 2.6.0-rc1 (release candidate 1), both to be released this week.
Quoting Hongjia Cao <[email protected]>: > This is a problem with the checkpointing code in SLURM. The attached > patch fixes it. > > 在 2013-05-29三的 12:50 -0700,Michael Gutteridge写道: >> We're having some trouble getting our slurm jobs to successfully >> restart after a checkpoint. For this test, I'm using sbatch and a >> simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. >> >> I'm submitting the job using sbatch: >> >> >> $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh >> >> >> >> I am able to create the checkpoint and vacate the node: >> >> >> $ scontrol checkpoint create 137 >> >> >> .... time passes .... >> >> >> $ scontrol vacate 137 >> >> >> At that point, I see the checkpoint file from blcr in the current >> directory and the checkpoint file from Slurm >> in /var/spool/slurm-llnl/checkpoint. However, when I attempt to >> restart the job: >> >> >> $ scontrol checkpoint restart 137 >> scontrol_checkpoint error: Node count specification invalid >> >> >> In slurmctld's log (at level 7) I see: >> >> >> [2013-05-29T12:41:08-07:00] debug2: Processing RPC: >> REQUEST_CHECKPOINT(restart) from uid=***** >> [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header >> is JOB_CKPT_002 >> [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 >> [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node >> count specification invalid >> >> >> Any insights would be appreciated. >> >> >> Thanks >> >> >> Michael >> >> >> >> > >
