When transparent checkpointing is enabled, is there any way to have the
checkpointing signal sent when a job is suspended or rescheduled?

I've worked through the examples for transparent checkpointing and
transparent checkpointing with Condor at
http://gridscheduler.sourceforge.net/howto/checkpointing.html using the
version 6.2u5p1. I can get checkpointing to work when I set "Checkpoint
When" to "On Min CPU Interval".

When I try setting "Checkpoint When" to "On Job Suspend" or "Reschedule Job"
and then try to suspend or reschedule a checkpointable job, no checkpoint
file is created, and as far as I can tell the checkpoint signal is not being
sent. The checkpointing Howto at the above URL says:


> Here the "when" condition for the creation of a checkpoint file is set to
> "xmr". So a checkpoint will be created, by sending the signal usr2 (to the
> job, i.e. the whole processgroup), when the specified time interval (which
> is defined in the queue definition) has elapsed, or when the node goes
> offline.. It's not possible to initiate a checkpoint just before the
> migration of the job, but we set the "x" anyway to get the job at least
> restarted. This is limited to happen in the migration script, which is
> available for application-level checkpointing.
>

 If checkpointing isn't allowed on suspend or reschedule, why are those
options listed?

Thanks,
Lane
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to