[slurm-dev] Re: Messing with job checkpointing

Moe Jette Tue, 02 Jun 2015 07:31:22 -0700


See the MinJobAge configuration option:
http://slurm.schedmd.com/slurm.conf.html


Quoting Manuel Rodríguez Pascual <[email protected]>:

Hi all,


I have been performing some more tests trying to understand the slurm
internals and to reduce the checkpoint/restart time.

Looking into the job status with slurm_print_job_info, I have observed that
it remains on "RUNNING" status for about 5 minutes after a
"slurm_checkpoint_vacate".

JobId=2133 JobName=variableSizeTester.sh
   UserId=slurm(500) GroupId=slurm(1000)
   Priority=4294901754 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:05:39 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-06-02T06:43:16 EligibleTime=2015-06-02T06:43:16
   StartTime=2015-06-02T06:43:17 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=debug AllocNode:Sid=slurm-master:2951
(...)


So when calling "slurm_checkpoint_restart", slurmctld complains with

attempt re-use active job_id 2133
slurm_rpc_checkpoint restart 2133: Duplicate job id


and same error is obtained until the aforementioned 5 minute limit, then
the job record is released, cleaned

slurmctld: debug2: Purging old records
slurmctld: debug2: purge_old_job: purged 1 old job records

and the checkpoint can then be restarted.

I have tried calling purge_old_job() to reduce this time but it does not
work, so I assume that the problem is that the job is considered to be
running and not a missinformation by slurmctld. Also, there is no query
from slurmctld to the compute element, this seems to be some kind of
internal timeout or something like that. Am I right?

My question is then, cannot be this time reduced somehow? Is there any
particular reason why the job is considered as active by Slurmctld for like
5 minutes after its checkpoint and cancellation?

Thanks for your attention.


Best regards,


Manuel








2015-05-29 18:00 GMT+02:00 Manuel Rodríguez Pascual <
[email protected]>:

 Hi all,

I have been messing around a little bit with task checkpoint/restart.

I am employing BLCR to checkpoint a fairly small application with
slurm_checkpoint_vacate, what should take several seconds. However, when I
try to restart it with slurm_checkpoint_restart, the process is very slow.

Looking at the output of slurmtcld ,  what I get is

----
slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
slurmctld: attempt re-use active job_id 2110
slurmctld: _slurm_rpc_checkpoint restart 2110: Duplicate job id
----

if I continue performing the same call, the output is identical for some
time, until slurm cleans its internal structures (or something like that),
writing in the log

----
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug2: purge_old_job: purged 1 old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug:  backfill: beginning
slurmctld: debug:  backfill: no jobs to backfill
----

then, the next call to slurm_checkpoint_restart succeeds,  with

----
slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
slurmctld: debug2: found 9 usable nodes from config containing
slurm-compute[1-9]
slurmctld: debug2: sched: JobId=2110 allocated resources: NodeList=(null)
slurmctld: _slurm_rpc_checkpoint restart for 2110 usec=909
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  backfill: beginning
slurmctld: debug2: backfill: entering _try_sched for job 2110.
slurmctld: debug2: found 2 usable nodes from config containing
slurm-compute[1-9]
slurmctld: backfill: Started JobId=2110 on slurm-compute2
----


I am wondering why is all this necessary. Why can't the "vacate" call
delete everything related to the job, so it can be restarted immediately?
If there is any particular reason that makes that impossible, why cannot
the Slurm structures be cleaned (purged or whatever) every 10 seconds or
so, instead of once every 5-10 minutes? Does it cause a significant
overhead or scalability issue? Or as an alternative,  is there any API call
that can be employed to trigger that purge?


Thanks for your help,


Manuel






--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN




--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN



--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

[slurm-dev] Re: Messing with job checkpointing

Reply via email to