We are currently exploring BLCR checkpointing of SLURM jobs in a test environment with the hope that this will be something we can put in our production environment for our users. Depending on how robust BLCR checkpointing is this could be a huge win for us as it we would make downtimes (planned or unplanned!) far less painful for our users.
I’m curious, what other sites have experience with BLCR checkpointing? We have users from a variety of research domains who are not always the most sophisticated. The majority of our jobs are single-core Matlab, Python, or R, with a mix of parallel multithreaded, MPI, and GPU jobs sprinkled in. We’ll be testing these applications over the next several weeks, but as we’re getting started I would be very interested in hearing about others’ experiences with BLCR checkpointing (good or bad) in SLURM. We’re currently running 15.08.7 in production. Many thanks! Will French