We are currently exploring BLCR checkpointing of SLURM jobs in a test 
environment with the hope that this will be something we can put in our 
production environment for our users. Depending on how robust BLCR 
checkpointing is this could be a huge win for us as it we would make downtimes 
(planned or unplanned!) far less painful for our users.

I’m curious, what other sites have experience with BLCR checkpointing? We have 
users from a variety of research domains who are not always the most 
sophisticated. The majority of our jobs are single-core Matlab, Python, or R, 
with a mix of parallel multithreaded, MPI, and GPU jobs sprinkled in. We’ll be 
testing these applications over the next several weeks, but as we’re getting 
started I would be very interested in hearing about others’ experiences with 
BLCR checkpointing (good or bad) in SLURM.

We’re currently running 15.08.7 in production.

Many thanks!

Will French 

Reply via email to