Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

Joseph Farran Fri, 01 Nov 2013 11:20:54 -0700

Hi Reuti.

Yes, after going through the logs, the subsequent restarts are messed up.


I've played with it more and there is easy no way to do this inside the job 
submission script, so I will have to resort ( as you indicated ) to using outside 
script to run periodically and do a "qsub -sj  job / job.task-id when near  the 
s_rt value.

It seems to me that Grid Engine is missing an option in the checkpoint environment to 
deal when s_rt value has been reached to then trigger the equivalent of a suspension ( 
"qsub -sj " ).

Best,
Joseph

On 10/31/2013 04:23 PM, Reuti wrote:


Although this looks fine, I can't get it working. I mean: it's working for the 
first time, but in the second iteration the job is killed directly even if 
there is no h_rt attached at all (or set in the queue definition).

It looks like SGE is checking whether there was any warning already and if so, 
issues directly a SIGKILL - this is on the one hand wrong of course. But it's 
for sure a matter of discussion: is s_rt/h_rt per iteration or for the overall 
job time? (maybe: queue = per interation, resource request = overall time?)

I see only the option to do this outside of SGE and issue once in a while `qstatus 
-r`*) to get the runtime per job and make appropriate measures, i.e. execute `qmod 
-sj <job_id>` as you intended.

-- Reuti

*) It's necessary to make a change to the awk script to get the raw output instead the 
formatted time in the "(relative)" case:

starttime=sprintf("%s", running_seconds)


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

Reply via email to