On 06/13/2012 03:20 PM, Kevin Buckley wrote:
On 11 January 2012 21:28, Reuti<[email protected]>  wrote:

In the rare cases when you really want to extend the runtime of a job, you could
kill the execd on the node (e.g. by "softstop" as argument to the startscript
"sgeexed" in /etc). Then the job won't be aborted. But also no new jobs will
be send to the node as it appears as being unavailable to the qmaster. You
have to check by hand on the node, whether the job finished in the meantime
and restart the execd. There is also only an email, when the execd restarts.

-- Reuti

Just found this thread (unread by me at the time) when desperately searching
the interweb thing.

I am in the situation where someone has submitted a long-running job without
bothering to do any checkpointing and now finds, as they get towards the end
of the 2880 hours they asked for, including surviving a machine room power
outage where we thought that particular node was going to lose power, they'd
like "another 1000 hours" - now how about that!?

The above "kludge" would seem to offer the affected party a way out of their
lack of foresight however, I am not keen to see the node the job is running
on taken out of operation.

I am hoping that Reuti, or someone else, can enlarge upon the suggestion
above by asking what happens if you start up the execd again?

Presumably it is going to notice the runing job and see that it has
run over time?

Yes.


What communication between the node and the master is still left hanging around
if you "softstop" the execd?


None. The softstop tell execd to exit without touching any of the currently running jobs. You can go take a test node, run some jobs on it, try 'stop' vs 'softstop' and see for yourself.

Is it possible to alter the local execd configuration so that a new
instance could be
started and have the node then accept other tasks, whilst still retaining the
original communication ports ?


I'm not sure, but I think not.

If not, then I'm sure I can get other jobs to run on the node in
question but it'll
take a bit of ferkling

Kevin
ECS, VUW, NZ
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

--
Alex Chekholko [email protected] 347-401-4860
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to