Am 11.03.2011 um 15:02 schrieb Lane Schwartz:

> When transparent checkpointing is enabled, is there any way to have the 
> checkpointing signal sent when a job is suspended or rescheduled?
>  
> I've worked through the examples for transparent checkpointing and 
> transparent checkpointing with Condor at 
> http://gridscheduler.sourceforge.net/howto/checkpointing.html using the 
> version 6.2u5p1. I can get checkpointing to work when I set "Checkpoint When" 
> to "On Min CPU Interval".
>  
> When I try setting "Checkpoint When" to "On Job Suspend" or "Reschedule Job" 
> and then try to suspend or reschedule a checkpointable job, no checkpoint 
> file is created, and as far as I can tell the checkpoint signal is not being 
> sent.

Correct.


> The checkpointing Howto at the above URL says:
>  
> Here the "when" condition for the creation of a checkpoint file is set to 
> "xmr". So a checkpoint will be created, by sending the signal usr2 (to the 
> job, i.e. the whole processgroup), when the specified time interval (which is 
> defined in the queue definition) has elapsed, or when the node goes offline.. 
> It's not possible to initiate a checkpoint just before the migration of the 
> job, but we set the "x" anyway to get the job at least restarted. This is 
> limited to happen in the migration script, which is available for 
> application-level checkpointing.
>  
>  If checkpointing isn't allowed on suspend or reschedule, why are those 
> options listed?

It's not that it's not allowed. For "x" it's not implemented and for "r" it's 
not possible, as the node became unknown (possibly crashed).


"m" = create a checkpoint "when": the specified time interval has elapsed
"r" = reschedule "when": node goes offline
"x" = reschedule "when": job gets suspended

The "when" has different meanings you could say.

"r" means to reschedule the job when the exechost gets into unknown state. Not 
that you reschedule by hand.

==

The original documentation is wrong (`man checkpoint`).

So "m" defines "when" to create a checkpoint, and the other three define "when" 
the job should be "rescheduled" (no checkpoint created here). Hence we need "x" 
to allow to trigger the rescheduling of a job by its suspension.

The state diagrams in http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf may 
show it a little bit.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to