I have found that in order to support SUSPEND preemption we can not use 
CR_Memory or Memory as a consumable resource.  I've seen that if a preemptable 
partition has requested 15900MB of RAM on a 16GB node then the job will not be 
preempted and understandably so.  Now I'm looking at how to implement 
Preemption using Checkpoint.  However I'm unable to find any documentation on 
the exact behavior, configuration and necessary packages.

I have rebuilt the BLCR SRPM for my cluster, and am unsure which packages are 
necessary for the various systems.  I have the SLURM controller, SLURM compute 
nodes and SLURM submit hosts (login nodes) that do not run the slurm daemon but 
only submit jobs.

I'm also unsure what the expected behavior of when a job is preempted and 
checkpointed.  Will the job's state be saved?  The documentation mentions 
ImageDir but does not mention how it's set outside of interactive scontrol 
commands.  If I enable PreemptMode=CHECKPOINT, I'm just not clear on what the 
expected behavior will be for a user's job.

Any guidance on how other sites have implemented BLCR checkpointing, and your 
experiences would be useful.

Thanks,
- Trey

=============================

Trey Dockendorf 
Systems Analyst I 
Texas A&M University 
Academy for Advanced Telecommunications and Learning Technologies 
Phone: (979)458-2396 
Email: [email protected] 
Jabber: [email protected]

Reply via email to