Hi Reuti.
Yes, after going through the logs, the subsequent restarts are messed up.
I've played with it more and there is easy no way to do this inside the job
submission script, so I will have to resort ( as you indicated ) to using outside
script to run periodically and do a "qsub -sj job / job.task-id when near the
s_rt value.
It seems to me that Grid Engine is missing an option in the checkpoint environment to
deal when s_rt value has been reached to then trigger the equivalent of a suspension (
"qsub -sj " ).
Best,
Joseph
On 10/31/2013 04:23 PM, Reuti wrote:
Although this looks fine, I can't get it working. I mean: it's working for the
first time, but in the second iteration the job is killed directly even if
there is no h_rt attached at all (or set in the queue definition).
It looks like SGE is checking whether there was any warning already and if so,
issues directly a SIGKILL - this is on the one hand wrong of course. But it's
for sure a matter of discussion: is s_rt/h_rt per iteration or for the overall
job time? (maybe: queue = per interation, resource request = overall time?)
I see only the option to do this outside of SGE and issue once in a while `qstatus
-r`*) to get the runtime per job and make appropriate measures, i.e. execute `qmod
-sj <job_id>` as you intended.
-- Reuti
*) It's necessary to make a change to the awk script to get the raw output instead the
formatted time in the "(relative)" case:
starttime=sprintf("%s", running_seconds)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users