On 11 January 2012 21:28, Reuti <[email protected]> wrote:
>
> In the rare cases when you really want to extend the runtime of a job, you 
> could
> kill the execd on the node (e.g. by "softstop" as argument to the startscript
> "sgeexed" in /etc). Then the job won't be aborted. But also no new jobs will
> be send to the node as it appears as being unavailable to the qmaster. You
> have to check by hand on the node, whether the job finished in the meantime
> and restart the execd. There is also only an email, when the execd restarts.
>
> -- Reuti

Just found this thread (unread by me at the time) when desperately searching
the interweb thing.

I am in the situation where someone has submitted a long-running job without
bothering to do any checkpointing and now finds, as they get towards the end
of the 2880 hours they asked for, including surviving a machine room power
outage where we thought that particular node was going to lose power, they'd
like "another 1000 hours" - now how about that!?

The above "kludge" would seem to offer the affected party a way out of their
lack of foresight however, I am not keen to see the node the job is running
on taken out of operation.

I am hoping that Reuti, or someone else, can enlarge upon the suggestion
above by asking what happens if you start up the execd again?

Presumably it is going to notice the runing job and see that it has
run over time?

What communication between the node and the master is still left hanging around
if you "softstop" the execd?

Is it possible to alter the local execd configuration so that a new
instance could be
started and have the node then accept other tasks, whilst still retaining the
original communication ports ?

If not, then I'm sure I can get other jobs to run on the node in
question but it'll
take a bit of ferkling

Kevin
ECS, VUW, NZ
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to