Hi,

Am 14.06.2012 um 00:20 schrieb Kevin Buckley:

> On 11 January 2012 21:28, Reuti <[email protected]> wrote:
>> 
>> In the rare cases when you really want to extend the runtime of a job, you 
>> could
>> kill the execd on the node (e.g. by "softstop" as argument to the startscript
>> "sgeexed" in /etc). Then the job won't be aborted. But also no new jobs will
>> be send to the node as it appears as being unavailable to the qmaster. You
>> have to check by hand on the node, whether the job finished in the meantime
>> and restart the execd. There is also only an email, when the execd restarts.
>> 
>> -- Reuti
> 
> Just found this thread (unread by me at the time) when desperately searching
> the interweb thing.
> 
> I am in the situation where someone has submitted a long-running job without
> bothering to do any checkpointing and now finds, as they get towards the end
> of the 2880 hours they asked for, including surviving a machine room power
> outage where we thought that particular node was going to lose power, they'd
> like "another 1000 hours" - now how about that!?
> 
> The above "kludge" would seem to offer the affected party a way out of their
> lack of foresight however, I am not keen to see the node the job is running
> on taken out of operation.
> 
> I am hoping that Reuti, or someone else, can enlarge upon the suggestion
> above by asking what happens if you start up the execd again?
> 
> Presumably it is going to notice the runing job and see that it has
> run over time?

Exactly, it will kill the job.


> What communication between the node and the master is still left hanging 
> around
> if you "softstop" the execd?

None any longer. The sgeexecd is gone, you could also a sigkill the sgeexecd 
directly.


> Is it possible to alter the local execd configuration so that a new
> instance could be
> started and have the node then accept other tasks, whilst still retaining the
> original communication ports ?

Never tried it (so, no guarantee): before starting the execd again, you may 
need to change the location of the spool directory of the execd (i.e. in the 
local host configuration: `qconf -mconf node17` or alike the set 
"execd_spool_dir"). Then it won't see the former jobs. After the long running 
job ended, it needs to be removed from the qmaster list of jobs by `qdel -f 
1234`.

-- Reuti


> If not, then I'm sure I can get other jobs to run on the node in
> question but it'll
> take a bit of ferkling
> 
> Kevin
> ECS, VUW, NZ


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to