Hi, Am 14.06.2012 um 00:20 schrieb Kevin Buckley:
> On 11 January 2012 21:28, Reuti <[email protected]> wrote: >> >> In the rare cases when you really want to extend the runtime of a job, you >> could >> kill the execd on the node (e.g. by "softstop" as argument to the startscript >> "sgeexed" in /etc). Then the job won't be aborted. But also no new jobs will >> be send to the node as it appears as being unavailable to the qmaster. You >> have to check by hand on the node, whether the job finished in the meantime >> and restart the execd. There is also only an email, when the execd restarts. >> >> -- Reuti > > Just found this thread (unread by me at the time) when desperately searching > the interweb thing. > > I am in the situation where someone has submitted a long-running job without > bothering to do any checkpointing and now finds, as they get towards the end > of the 2880 hours they asked for, including surviving a machine room power > outage where we thought that particular node was going to lose power, they'd > like "another 1000 hours" - now how about that!? > > The above "kludge" would seem to offer the affected party a way out of their > lack of foresight however, I am not keen to see the node the job is running > on taken out of operation. > > I am hoping that Reuti, or someone else, can enlarge upon the suggestion > above by asking what happens if you start up the execd again? > > Presumably it is going to notice the runing job and see that it has > run over time? Exactly, it will kill the job. > What communication between the node and the master is still left hanging > around > if you "softstop" the execd? None any longer. The sgeexecd is gone, you could also a sigkill the sgeexecd directly. > Is it possible to alter the local execd configuration so that a new > instance could be > started and have the node then accept other tasks, whilst still retaining the > original communication ports ? Never tried it (so, no guarantee): before starting the execd again, you may need to change the location of the spool directory of the execd (i.e. in the local host configuration: `qconf -mconf node17` or alike the set "execd_spool_dir"). Then it won't see the former jobs. After the long running job ended, it needs to be removed from the qmaster list of jobs by `qdel -f 1234`. -- Reuti > If not, then I'm sure I can get other jobs to run on the node in > question but it'll > take a bit of ferkling > > Kevin > ECS, VUW, NZ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
