Hi Reuti,
First off, here is the qconf sq. Anything obvious?
qname test.q
hostlist @test_rack1 @test_rack2 \
@test_rack3
seq_no 0
load_thresholds load_short=99
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list test.ckpt
pe_list test.pe
rerun TRUE
slots 1,[@test_rack1=8],[@test_rack2=8], \
[@test_rack3=8]
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode unix_behavior
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Regards,
Joseph David Borġ
josephb.org
On 17 January 2014 15:00, Reuti <[email protected]> wrote:
> Can you please post the output of `qconf -sq <qname>` for the queue in
> question and the output of `qstat -j <job_id>` for such a job.
>
> -- Reuti
>
>
> Am 16.01.2014 um 23:45 schrieb Joe Borġ:
>
> > Only in qsub
> >
> >
> >
> > Regards,
> > Joseph David Borġ
> > josephb.org
> >
> >
> > On 16 January 2014 16:59, Reuti <[email protected]> wrote:
> > Am 16.01.2014 um 17:17 schrieb Joe Borġ:
> >
> > > I checked with qstat -j and the value displayed (in seconds) were
> correct.
> > >
> > > With qacct, I get
> > >
> > > failed 100 : assumedly after job
> > > exit_status 137
> > >
> > > Seeing as they were all killed at the exact same run time, I can't see
> what else could have done it.
> >
> > As mentioned, was there something in the messages file like:
> >
> > 01/16/2014 17:54:46| main|pc15370|W|job 10561.1 exceeded hard wallclock
> time - initiate terminate method
> >
> > Is the limit in the `qsub` command or in the queue definition?
> >
> > -- Reuti
> >
> >
> > > Regards,
> > > Joseph David Borġ
> > > josephb.org
> > >
> > >
> > > On 15 January 2014 20:30, Reuti <[email protected]> wrote:
> > > Hi,
> > >
> > > Am 15.01.2014 um 18:55 schrieb Joe Borġ:
> > >
> > > > I have it working, except even if I put jobs run time as 24 hours,
> they all get killed after 6hours 40mins.
> > >
> > > 6h 40m = 360m + 40m = 400m = 24000s - did you forget by accident the
> colons when you defined the limit?
> > >
> > >
> > > > Looking at qstat -j shows the correct number of seconds against
> hard_resource_list h_rt.
> > > >
> > > > Any ideas?
> > >
> > > Was it really killed by SGE: is there any hint in the messages file of
> the node, i.e. something like /var/spool/sge/node01/messages about the
> reason for the kill ("loglevel log_info" in the `qconf -mconf`)?.
> > >
> > > -- Reuti
> > >
> > >
> > > > Regards,
> > > > Joseph David Borġ
> > > > josephb.org
> > > >
> > > >
> > > > On 15 January 2014 10:24, Reuti <[email protected]> wrote:
> > > > Hi,
> > > >
> > > > Am 15.01.2014 um 11:16 schrieb Joe Borġ:
> > > >
> > > > > Using h_rt kills the job after the allotted time.
> > > >
> > > > Yes.
> > > >
> > > >
> > > > > Can't this be disabled?
> > > >
> > > > There is no feature in SGE to extend the granted runtime of a job (I
> heard such a thing is available in Torque).
> > > >
> > > >
> > > > > We only want to use it as a rough guide.
> > > >
> > > > If you want to do it only once in a time for a particular job:
> > > >
> > > > In this case you can just kill (or softstop) the `sgeexecd` on the
> node. You will lose control of the jobs on the node and the node (from
> SGE's view - `qhost` shows "-" for the node's load). So you have to check
> from time to time whether the job in question finished already, and then
> restart the `sgeexecd`. Also no new jobs will be scheduled to the node.
> > > >
> > > > Only at point of restarting the `sgeexecd` it will discover that the
> job finished (and send an email if applicable). Other (still) running jobs
> will gain supervision of their runtime again.
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > > Joseph David Borġ
> > > > > josephb.org
> > > > >
> > > > >
> > > > > On 13 January 2014 17:43, Reuti <[email protected]>
> wrote:
> > > > > Am 13.01.2014 um 18:33 schrieb Joe Borġ:
> > > > >
> > > > > > Thanks. Can you please tell me what I'm doing wrong?
> > > > > >
> > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash
> > > > > > qsub -q test.q -R y -l h_rt=120 -pe test.pe 2 big.bash
> > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash
> > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash
> > > > >
> > > > > Only the parallel job needs "-R y".
> > > > >
> > > > >
> > > > > >
> > > > > > job-ID prior name user state submit/start at
> queue slots ja-task-ID
> > > > > >
> -----------------------------------------------------------------------------------------------------------------
> > > > > > 156757 0.50000 small.bash joe.borg qw 01/13/2014
> 16:45:18 1
> > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014
> 16:55:31 2
> > > > > > 156762 0.50000 small.bash joe.borg qw 01/13/2014
> 16:55:33 1
> > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014
> 16:55:34 1
> > > > > >
> > > > > > ...But when I release...
> > > > >
> > > > > max_reservation is set?
> > > > >
> > > > > But the reservation feature must also be seen in a running
> cluster. If all four jobs are on hold and released at once, I wouldn't be
> surprised if it's not strictly FIFO.
> > > > >
> > > > >
> > > > > > job-ID prior name user state submit/start at
> queue slots ja-task-ID
> > > > > >
> -----------------------------------------------------------------------------------------------------------------
> > > > > > 156757 0.50000 small.bash joe.borg r 01/13/2014
> 16:56:06 test.q@test 1
> > > > > > 156762 0.50000 small.bash joe.borg r 01/13/2014
> 16:56:06 test.q@test 1
> > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014
> 16:55:31 2
> > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014
> 16:55:34 1
> > > > >
> > > > > As job 156762 has the same runtime as 156757, backfilling will
> occur to use the otherwise idling core. Whether job 156762 is started or
> not, the parallel one 156761 will start at the same time. Only 156763
> shouldn't start.
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Joseph David Borġ
> > > > > > josephb.org
> > > > > >
> > > > > >
> > > > > > On 13 January 2014 17:26, Reuti <[email protected]>
> wrote:
> > > > > > Am 13.01.2014 um 17:24 schrieb Joe Borġ:
> > > > > >
> > > > > > > Hi Reuti,
> > > > > > >
> > > > > > > I am using a PE, so that's fine.
> > > > > > >
> > > > > > > I've not set either of the other 3. Will the job be killed if
> default_duration is exceeded?
> > > > > >
> > > > > > No. It can be set to any value you like (like a few weeks), but
> it shouldn't be set to "INFINITY" as SGE judges infinity being smaller than
> infinity and so backfilling will always occur.
> > > > > >
> > > > > > -- Reuti
> > > > > >
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Joseph David Borġ
> > > > > > > josephb.org
> > > > > > >
> > > > > > >
> > > > > > > On 13 January 2014 16:16, Reuti <[email protected]>
> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Am 13.01.2014 um 16:58 schrieb Joe Borġ:
> > > > > > >
> > > > > > > > I'm trying to set up an SGE queue and am having a problem
> getting the jobs to start in the right order. Here is my example - test.q
> with 2 possible slots and the following jobs queued:
> > > > > > > >
> > > > > > > > job-ID prior name user state submit/start
> at queue slots ja-task-ID
> > > > > > > >
> -----------------------------------------------------------------------------------------------------------------
> > > > > > > > 1 0.50000 small.bash joe.borg qw
> 01/13/2014 15:43:16 1
> > > > > > > > 2 0.50000 big.bash joe.borg qw
> 01/13/2014 15:43:24 2
> > > > > > > > 3 0.50000 small.bash joe.borg qw
> 01/13/2014 15:43:27 1
> > > > > > > > 4 0.50000 small.bash joe.borg qw
> 01/13/2014 15:43:28 1
> > > > > > > >
> > > > > > > > I want the jobs to run in that order, but (obviously), when
> I enable the queue, the small jobs fill the available slots and the big job
> has to wait for them to complete. I'd like it setup so that only job 1
> runs; finishes, then 2 (with both slots), then the final 2 jobs, 3 & 4,
> together.
> > > > > > > >
> > > > > > > > I've looked at -R y on submission, but doesn't seem to work.
> > > > > > >
> > > > > > > For the reservation to work (and it's only necessary to
> request it for the parallel job) it's necessary to have suitable "h_rt"
> requests for all jobs.
> > > > > > >
> > > > > > > - Do you request any "h_rt" for all jobs?
> > > > > > > - Do you have a "default_duration" set to a proper value in
> the schedule configuration otherwise?
> > > > > > > - Is "max_reservation" set to a value like 16?
> > > > > > >
> > > > > > > -- Reuti
> > > > > > >
> > > > > > >
> > > > > > > > Regards,
> > > > > > > > Joseph David Borġ
> > > > > > > > josephb.org
> > > > > > > > _______________________________________________
> > > > > > > > users mailing list
> > > > > > > > [email protected]
> > > > > > > > https://gridengine.org/mailman/listinfo/users
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users