Hi Reuti, They're pretty mammoth cluster nodes. But anyway, the load is set to 99 to stop it from rejecting jobs based on load (in this test queue).
We're not currently using the checkpoint feature either. Regards, Joseph David Borġ josephb.org On 23 January 2014 10:11, Reuti <[email protected]> wrote: > Hi, > > Am 23.01.2014 um 10:17 schrieb Joe Borġ: > > > First off, here is the qconf sq. Anything obvious? > > > > qname test.q > > hostlist @test_rack1 @test_rack2 \ > > @test_rack3 > > seq_no 0 > > load_thresholds load_short=99 > > What type of machine is this, with a threshold of 99? > > > suspend_thresholds NONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processors UNDEFINED > > qtype BATCH INTERACTIVE > > ckpt_list test.ckpt > > We these jobs submitted using a checkpointing interface? > Is this queue subordinated to any other queue? > > --Reuti > > > > pe_list test.pe > > rerun TRUE > > slots 1,[@test_rack1=8],[@test_rack2=8], \ > > [@test_rack3=8] > > tmpdir /tmp > > shell /bin/sh > > prolog NONE > > epilog NONE > > shell_start_mode unix_behavior > > starter_method NONE > > suspend_method NONE > > resume_method NONE > > terminate_method NONE > > notify 00:00:60 > > owner_list NONE > > user_lists NONE > > xuser_lists NONE > > subordinate_list NONE > > complex_values NONE > > projects NONE > > xprojects NONE > > calendar NONE > > initial_state default > > s_rt INFINITY > > h_rt INFINITY > > s_cpu INFINITY > > h_cpu INFINITY > > s_fsize INFINITY > > h_fsize INFINITY > > s_data INFINITY > > h_data INFINITY > > s_stack INFINITY > > h_stack INFINITY > > s_core INFINITY > > h_core INFINITY > > s_rss INFINITY > > h_rss INFINITY > > s_vmem INFINITY > > h_vmem INFINITY > > > > > > > > > > Regards, > > Joseph David Borġ > > josephb.org > > > > > > On 17 January 2014 15:00, Reuti <[email protected]> wrote: > > Can you please post the output of `qconf -sq <qname>` for the queue in > question and the output of `qstat -j <job_id>` for such a job. > > > > -- Reuti > > > > > > Am 16.01.2014 um 23:45 schrieb Joe Borġ: > > > > > Only in qsub > > > > > > > > > > > > Regards, > > > Joseph David Borġ > > > josephb.org > > > > > > > > > On 16 January 2014 16:59, Reuti <[email protected]> wrote: > > > Am 16.01.2014 um 17:17 schrieb Joe Borġ: > > > > > > > I checked with qstat -j and the value displayed (in seconds) were > correct. > > > > > > > > With qacct, I get > > > > > > > > failed 100 : assumedly after job > > > > exit_status 137 > > > > > > > > Seeing as they were all killed at the exact same run time, I can't > see what else could have done it. > > > > > > As mentioned, was there something in the messages file like: > > > > > > 01/16/2014 17:54:46| main|pc15370|W|job 10561.1 exceeded hard > wallclock time - initiate terminate method > > > > > > Is the limit in the `qsub` command or in the queue definition? > > > > > > -- Reuti > > > > > > > > > > Regards, > > > > Joseph David Borġ > > > > josephb.org > > > > > > > > > > > > On 15 January 2014 20:30, Reuti <[email protected]> wrote: > > > > Hi, > > > > > > > > Am 15.01.2014 um 18:55 schrieb Joe Borġ: > > > > > > > > > I have it working, except even if I put jobs run time as 24 hours, > they all get killed after 6hours 40mins. > > > > > > > > 6h 40m = 360m + 40m = 400m = 24000s - did you forget by accident the > colons when you defined the limit? > > > > > > > > > > > > > Looking at qstat -j shows the correct number of seconds against > hard_resource_list h_rt. > > > > > > > > > > Any ideas? > > > > > > > > Was it really killed by SGE: is there any hint in the messages file > of the node, i.e. something like /var/spool/sge/node01/messages about the > reason for the kill ("loglevel log_info" in the `qconf -mconf`)?. > > > > > > > > -- Reuti > > > > > > > > > > > > > Regards, > > > > > Joseph David Borġ > > > > > josephb.org > > > > > > > > > > > > > > > On 15 January 2014 10:24, Reuti <[email protected]> > wrote: > > > > > Hi, > > > > > > > > > > Am 15.01.2014 um 11:16 schrieb Joe Borġ: > > > > > > > > > > > Using h_rt kills the job after the allotted time. > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > Can't this be disabled? > > > > > > > > > > There is no feature in SGE to extend the granted runtime of a job > (I heard such a thing is available in Torque). > > > > > > > > > > > > > > > > We only want to use it as a rough guide. > > > > > > > > > > If you want to do it only once in a time for a particular job: > > > > > > > > > > In this case you can just kill (or softstop) the `sgeexecd` on the > node. You will lose control of the jobs on the node and the node (from > SGE's view - `qhost` shows "-" for the node's load). So you have to check > from time to time whether the job in question finished already, and then > restart the `sgeexecd`. Also no new jobs will be scheduled to the node. > > > > > > > > > > Only at point of restarting the `sgeexecd` it will discover that > the job finished (and send an email if applicable). Other (still) running > jobs will gain supervision of their runtime again. > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > Joseph David Borġ > > > > > > josephb.org > > > > > > > > > > > > > > > > > > On 13 January 2014 17:43, Reuti <[email protected]> > wrote: > > > > > > Am 13.01.2014 um 18:33 schrieb Joe Borġ: > > > > > > > > > > > > > Thanks. Can you please tell me what I'm doing wrong? > > > > > > > > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash > > > > > > > qsub -q test.q -R y -l h_rt=120 -pe test.pe 2 big.bash > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash > > > > > > > > > > > > Only the parallel job needs "-R y". > > > > > > > > > > > > > > > > > > > > > > > > > > job-ID prior name user state submit/start at > queue slots ja-task-ID > > > > > > > > ----------------------------------------------------------------------------------------------------------------- > > > > > > > 156757 0.50000 small.bash joe.borg qw 01/13/2014 > 16:45:18 1 > > > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014 > 16:55:31 2 > > > > > > > 156762 0.50000 small.bash joe.borg qw 01/13/2014 > 16:55:33 1 > > > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014 > 16:55:34 1 > > > > > > > > > > > > > > ...But when I release... > > > > > > > > > > > > max_reservation is set? > > > > > > > > > > > > But the reservation feature must also be seen in a running > cluster. If all four jobs are on hold and released at once, I wouldn't be > surprised if it's not strictly FIFO. > > > > > > > > > > > > > > > > > > > job-ID prior name user state submit/start at > queue slots ja-task-ID > > > > > > > > ----------------------------------------------------------------------------------------------------------------- > > > > > > > 156757 0.50000 small.bash joe.borg r 01/13/2014 > 16:56:06 test.q@test 1 > > > > > > > 156762 0.50000 small.bash joe.borg r 01/13/2014 > 16:56:06 test.q@test 1 > > > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014 > 16:55:31 2 > > > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014 > 16:55:34 1 > > > > > > > > > > > > As job 156762 has the same runtime as 156757, backfilling will > occur to use the otherwise idling core. Whether job 156762 is started or > not, the parallel one 156761 will start at the same time. Only 156763 > shouldn't start. > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > Joseph David Borġ > > > > > > > josephb.org > > > > > > > > > > > > > > > > > > > > > On 13 January 2014 17:26, Reuti <[email protected]> > wrote: > > > > > > > Am 13.01.2014 um 17:24 schrieb Joe Borġ: > > > > > > > > > > > > > > > Hi Reuti, > > > > > > > > > > > > > > > > I am using a PE, so that's fine. > > > > > > > > > > > > > > > > I've not set either of the other 3. Will the job be killed > if default_duration is exceeded? > > > > > > > > > > > > > > No. It can be set to any value you like (like a few weeks), > but it shouldn't be set to "INFINITY" as SGE judges infinity being smaller > than infinity and so backfilling will always occur. > > > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > Joseph David Borġ > > > > > > > > josephb.org > > > > > > > > > > > > > > > > > > > > > > > > On 13 January 2014 16:16, Reuti <[email protected]> > wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > Am 13.01.2014 um 16:58 schrieb Joe Borġ: > > > > > > > > > > > > > > > > > I'm trying to set up an SGE queue and am having a problem > getting the jobs to start in the right order. Here is my example - test.q > with 2 possible slots and the following jobs queued: > > > > > > > > > > > > > > > > > > job-ID prior name user state submit/start > at queue slots ja-task-ID > > > > > > > > > > ----------------------------------------------------------------------------------------------------------------- > > > > > > > > > 1 0.50000 small.bash joe.borg qw > 01/13/2014 15:43:16 1 > > > > > > > > > 2 0.50000 big.bash joe.borg qw > 01/13/2014 15:43:24 2 > > > > > > > > > 3 0.50000 small.bash joe.borg qw > 01/13/2014 15:43:27 1 > > > > > > > > > 4 0.50000 small.bash joe.borg qw > 01/13/2014 15:43:28 1 > > > > > > > > > > > > > > > > > > I want the jobs to run in that order, but (obviously), > when I enable the queue, the small jobs fill the available slots and the > big job has to wait for them to complete. I'd like it setup so that only > job 1 runs; finishes, then 2 (with both slots), then the final 2 jobs, 3 & > 4, together. > > > > > > > > > > > > > > > > > > I've looked at -R y on submission, but doesn't seem to > work. > > > > > > > > > > > > > > > > For the reservation to work (and it's only necessary to > request it for the parallel job) it's necessary to have suitable "h_rt" > requests for all jobs. > > > > > > > > > > > > > > > > - Do you request any "h_rt" for all jobs? > > > > > > > > - Do you have a "default_duration" set to a proper value in > the schedule configuration otherwise? > > > > > > > > - Is "max_reservation" set to a value like 16? > > > > > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Joseph David Borġ > > > > > > > > > josephb.org > > > > > > > > > _______________________________________________ > > > > > > > > > users mailing list > > > > > > > > > [email protected] > > > > > > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
