When transparent checkpointing is enabled, is there any way to have the checkpointing signal sent when a job is suspended or rescheduled?
I've worked through the examples for transparent checkpointing and transparent checkpointing with Condor at http://gridscheduler.sourceforge.net/howto/checkpointing.html using the version 6.2u5p1. I can get checkpointing to work when I set "Checkpoint When" to "On Min CPU Interval". When I try setting "Checkpoint When" to "On Job Suspend" or "Reschedule Job" and then try to suspend or reschedule a checkpointable job, no checkpoint file is created, and as far as I can tell the checkpoint signal is not being sent. The checkpointing Howto at the above URL says: > Here the "when" condition for the creation of a checkpoint file is set to > "xmr". So a checkpoint will be created, by sending the signal usr2 (to the > job, i.e. the whole processgroup), when the specified time interval (which > is defined in the queue definition) has elapsed, or when the node goes > offline.. It's not possible to initiate a checkpoint just before the > migration of the job, but we set the "x" anyway to get the job at least > restarted. This is limited to happen in the migration script, which is > available for application-level checkpointing. > If checkpointing isn't allowed on suspend or reschedule, why are those options listed? Thanks, Lane
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
