Thank you for the patch. This will be in versions 2.5.7 and 2.6.0-rc1  
(release candidate 1), both to be released this week.

Quoting Hongjia Cao <[email protected]>:

> This is a problem with the checkpointing code in SLURM. The attached
> patch fixes it.
>
> 在 2013-05-29三的 12:50 -0700,Michael Gutteridge写道:
>> We're having some trouble getting our slurm jobs to successfully
>> restart after a checkpoint.  For this test, I'm using sbatch and a
>> simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
>>
>> I'm submitting the job using sbatch:
>>
>>
>> $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
>>
>>
>>
>> I am able to create the checkpoint and vacate the node:
>>
>>
>> $ scontrol checkpoint create 137
>>
>>
>> .... time passes ....
>>
>>
>> $ scontrol vacate 137
>>
>>
>> At that point, I see the checkpoint file from blcr in the current
>> directory and the checkpoint file from Slurm
>> in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
>> restart the job:
>>
>>
>> $ scontrol checkpoint restart 137
>> scontrol_checkpoint error: Node count specification invalid
>>
>>
>> In slurmctld's log (at level 7) I see:
>>
>>
>> [2013-05-29T12:41:08-07:00] debug2: Processing RPC:
>> REQUEST_CHECKPOINT(restart) from uid=*****
>> [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header
>> is JOB_CKPT_002
>> [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
>> [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node
>> count specification invalid
>>
>>
>> Any insights would be appreciated.
>>
>>
>> Thanks
>>
>>
>> Michael
>>
>>
>>
>>
>
>

Reply via email to