On 11 January 2012 21:28, Reuti <[email protected]> wrote: > > In the rare cases when you really want to extend the runtime of a job, you > could > kill the execd on the node (e.g. by "softstop" as argument to the startscript > "sgeexed" in /etc). Then the job won't be aborted. But also no new jobs will > be send to the node as it appears as being unavailable to the qmaster. You > have to check by hand on the node, whether the job finished in the meantime > and restart the execd. There is also only an email, when the execd restarts. > > -- Reuti
Just found this thread (unread by me at the time) when desperately searching the interweb thing. I am in the situation where someone has submitted a long-running job without bothering to do any checkpointing and now finds, as they get towards the end of the 2880 hours they asked for, including surviving a machine room power outage where we thought that particular node was going to lose power, they'd like "another 1000 hours" - now how about that!? The above "kludge" would seem to offer the affected party a way out of their lack of foresight however, I am not keen to see the node the job is running on taken out of operation. I am hoping that Reuti, or someone else, can enlarge upon the suggestion above by asking what happens if you start up the execd again? Presumably it is going to notice the runing job and see that it has run over time? What communication between the node and the master is still left hanging around if you "softstop" the execd? Is it possible to alter the local execd configuration so that a new instance could be started and have the node then accept other tasks, whilst still retaining the original communication ports ? If not, then I'm sure I can get other jobs to run on the node in question but it'll take a bit of ferkling Kevin ECS, VUW, NZ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
