Am 11.03.2011 um 15:02 schrieb Lane Schwartz: > When transparent checkpointing is enabled, is there any way to have the > checkpointing signal sent when a job is suspended or rescheduled? > > I've worked through the examples for transparent checkpointing and > transparent checkpointing with Condor at > http://gridscheduler.sourceforge.net/howto/checkpointing.html using the > version 6.2u5p1. I can get checkpointing to work when I set "Checkpoint When" > to "On Min CPU Interval". > > When I try setting "Checkpoint When" to "On Job Suspend" or "Reschedule Job" > and then try to suspend or reschedule a checkpointable job, no checkpoint > file is created, and as far as I can tell the checkpoint signal is not being > sent.
Correct. > The checkpointing Howto at the above URL says: > > Here the "when" condition for the creation of a checkpoint file is set to > "xmr". So a checkpoint will be created, by sending the signal usr2 (to the > job, i.e. the whole processgroup), when the specified time interval (which is > defined in the queue definition) has elapsed, or when the node goes offline.. > It's not possible to initiate a checkpoint just before the migration of the > job, but we set the "x" anyway to get the job at least restarted. This is > limited to happen in the migration script, which is available for > application-level checkpointing. > > If checkpointing isn't allowed on suspend or reschedule, why are those > options listed? It's not that it's not allowed. For "x" it's not implemented and for "r" it's not possible, as the node became unknown (possibly crashed). "m" = create a checkpoint "when": the specified time interval has elapsed "r" = reschedule "when": node goes offline "x" = reschedule "when": job gets suspended The "when" has different meanings you could say. "r" means to reschedule the job when the exechost gets into unknown state. Not that you reschedule by hand. == The original documentation is wrong (`man checkpoint`). So "m" defines "when" to create a checkpoint, and the other three define "when" the job should be "rescheduled" (no checkpoint created here). Hence we need "x" to allow to trigger the rescheduling of a job by its suspension. The state diagrams in http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf may show it a little bit. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
