Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

Reuti Fri, 01 Nov 2013 11:56:56 -0700

Am 01.11.2013 um 19:29 schrieb Reuti:

> Hi,
> 
> Am 01.11.2013 um 19:18 schrieb Joseph Farran:
> 
>> Yes, after going through the logs, the subsequent restarts are messed up.
>> 
>> I've played with it more and there is easy no way to do this inside the job 
>> submission script,
> 
> Inside the submission script it's possible - I thought you were looking to 
> get it implemented in SGE (but the user has to take care of it [i.e. trust 
> the users] - or using a "startup_method"):
> 
> #!/bin/sh
> . /usr/sge/default/common/settings.sh
> { sleep 172800; qmod -sj $JOB_ID; } &


Interesting, I expected "{ list; }" will save a subshell process, but it looks 
like the opposite is true:

"(list) &" will be a direct child of the running bash
"{ list; } &" will create an additional bash instance

Hence, using "(list) &" might be better suited here.

-- Reuti


> ./my_application
> 
> 
>> so I will have to resort ( as you indicated ) to using outside script to run 
>> periodically and do a "qsub -sj  job / job.task-id when near  the s_rt value.
>> 
>> It seems to me that Grid Engine is missing an option in the checkpoint 
>> environment to deal when s_rt value has been reached to then trigger the 
>> equivalent of a suspension ( "qsub -sj " ).
> 
> Yes. I would call it runtime-intervall inside the checkpoint definition or 
> so, to distinguish it from s/h_rt.
> 
> -- Reuti
> 
> 
>> Best,
>> Joseph
>> 
>> On 10/31/2013 04:23 PM, Reuti wrote:
>>> 
>>> Although this looks fine, I can't get it working. I mean: it's working for 
>>> the first time, but in the second iteration the job is killed directly even 
>>> if there is no h_rt attached at all (or set in the queue definition).
>>> 
>>> It looks like SGE is checking whether there was any warning already and if 
>>> so, issues directly a SIGKILL - this is on the one hand wrong of course. 
>>> But it's for sure a matter of discussion: is s_rt/h_rt per iteration or for 
>>> the overall job time? (maybe: queue = per interation, resource request = 
>>> overall time?)
>>> 
>>> I see only the option to do this outside of SGE and issue once in a while 
>>> `qstatus -r`*) to get the runtime per job and make appropriate measures, 
>>> i.e. execute `qmod -sj <job_id>` as you intended.
>>> 
>>> -- Reuti
>>> 
>>> *) It's necessary to make a change to the awk script to get the raw output 
>>> instead the formatted time in the "(relative)" case:
>>> 
>>> starttime=sprintf("%s", running_seconds)
>>> 
>> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

Reply via email to