Am 27.01.2014 um 11:26 schrieb Joe Borġ: > They're pretty mammoth cluster nodes. But anyway, the load is set to 99 to > stop it from rejecting jobs based on load (in this test queue).
If you don't want load_thresholds, it can just be set to NONE. > We're not currently using the checkpoint feature either. Can you please submit a job with the two statements: ulimit -aH ulimit -aS -- Reuti > Regards, > Joseph David Borġ > josephb.org > > > On 23 January 2014 10:11, Reuti <[email protected]> wrote: > Hi, > > Am 23.01.2014 um 10:17 schrieb Joe Borġ: > > > First off, here is the qconf sq. Anything obvious? > > > > qname test.q > > hostlist @test_rack1 @test_rack2 \ > > @test_rack3 > > seq_no 0 > > load_thresholds load_short=99 > > What type of machine is this, with a threshold of 99? > > > suspend_thresholds NONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processors UNDEFINED > > qtype BATCH INTERACTIVE > > ckpt_list test.ckpt > > We these jobs submitted using a checkpointing interface? > Is this queue subordinated to any other queue? > > --Reuti > > > > pe_list test.pe > > rerun TRUE > > slots 1,[@test_rack1=8],[@test_rack2=8], \ > > [@test_rack3=8] > > tmpdir /tmp > > shell /bin/sh > > prolog NONE > > epilog NONE > > shell_start_mode unix_behavior > > starter_method NONE > > suspend_method NONE > > resume_method NONE > > terminate_method NONE > > notify 00:00:60 > > owner_list NONE > > user_lists NONE > > xuser_lists NONE > > subordinate_list NONE > > complex_values NONE > > projects NONE > > xprojects NONE > > calendar NONE > > initial_state default > > s_rt INFINITY > > h_rt INFINITY > > s_cpu INFINITY > > h_cpu INFINITY > > s_fsize INFINITY > > h_fsize INFINITY > > s_data INFINITY > > h_data INFINITY > > s_stack INFINITY > > h_stack INFINITY > > s_core INFINITY > > h_core INFINITY > > s_rss INFINITY > > h_rss INFINITY > > s_vmem INFINITY > > h_vmem INFINITY > > > > > > > > > > Regards, > > Joseph David Borġ > > josephb.org > > > > > > On 17 January 2014 15:00, Reuti <[email protected]> wrote: > > Can you please post the output of `qconf -sq <qname>` for the queue in > > question and the output of `qstat -j <job_id>` for such a job. > > > > -- Reuti > > > > > > Am 16.01.2014 um 23:45 schrieb Joe Borġ: > > > > > Only in qsub > > > > > > > > > > > > Regards, > > > Joseph David Borġ > > > josephb.org > > > > > > > > > On 16 January 2014 16:59, Reuti <[email protected]> wrote: > > > Am 16.01.2014 um 17:17 schrieb Joe Borġ: > > > > > > > I checked with qstat -j and the value displayed (in seconds) were > > > > correct. > > > > > > > > With qacct, I get > > > > > > > > failed 100 : assumedly after job > > > > exit_status 137 > > > > > > > > Seeing as they were all killed at the exact same run time, I can't see > > > > what else could have done it. > > > > > > As mentioned, was there something in the messages file like: > > > > > > 01/16/2014 17:54:46| main|pc15370|W|job 10561.1 exceeded hard wallclock > > > time - initiate terminate method > > > > > > Is the limit in the `qsub` command or in the queue definition? > > > > > > -- Reuti > > > > > > > > > > Regards, > > > > Joseph David Borġ > > > > josephb.org > > > > > > > > > > > > On 15 January 2014 20:30, Reuti <[email protected]> wrote: > > > > Hi, > > > > > > > > Am 15.01.2014 um 18:55 schrieb Joe Borġ: > > > > > > > > > I have it working, except even if I put jobs run time as 24 hours, > > > > > they all get killed after 6hours 40mins. > > > > > > > > 6h 40m = 360m + 40m = 400m = 24000s - did you forget by accident the > > > > colons when you defined the limit? > > > > > > > > > > > > > Looking at qstat -j shows the correct number of seconds against > > > > > hard_resource_list h_rt. > > > > > > > > > > Any ideas? > > > > > > > > Was it really killed by SGE: is there any hint in the messages file of > > > > the node, i.e. something like /var/spool/sge/node01/messages about the > > > > reason for the kill ("loglevel log_info" in the `qconf -mconf`)?. > > > > > > > > -- Reuti > > > > > > > > > > > > > Regards, > > > > > Joseph David Borġ > > > > > josephb.org > > > > > > > > > > > > > > > On 15 January 2014 10:24, Reuti <[email protected]> wrote: > > > > > Hi, > > > > > > > > > > Am 15.01.2014 um 11:16 schrieb Joe Borġ: > > > > > > > > > > > Using h_rt kills the job after the allotted time. > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > Can't this be disabled? > > > > > > > > > > There is no feature in SGE to extend the granted runtime of a job (I > > > > > heard such a thing is available in Torque). > > > > > > > > > > > > > > > > We only want to use it as a rough guide. > > > > > > > > > > If you want to do it only once in a time for a particular job: > > > > > > > > > > In this case you can just kill (or softstop) the `sgeexecd` on the > > > > > node. You will lose control of the jobs on the node and the node > > > > > (from SGE's view - `qhost` shows "-" for the node's load). So you > > > > > have to check from time to time whether the job in question finished > > > > > already, and then restart the `sgeexecd`. Also no new jobs will be > > > > > scheduled to the node. > > > > > > > > > > Only at point of restarting the `sgeexecd` it will discover that the > > > > > job finished (and send an email if applicable). Other (still) running > > > > > jobs will gain supervision of their runtime again. > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > Joseph David Borġ > > > > > > josephb.org > > > > > > > > > > > > > > > > > > On 13 January 2014 17:43, Reuti <[email protected]> wrote: > > > > > > Am 13.01.2014 um 18:33 schrieb Joe Borġ: > > > > > > > > > > > > > Thanks. Can you please tell me what I'm doing wrong? > > > > > > > > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash > > > > > > > qsub -q test.q -R y -l h_rt=120 -pe test.pe 2 big.bash > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash > > > > > > > > > > > > Only the parallel job needs "-R y". > > > > > > > > > > > > > > > > > > > > > > > > > > job-ID prior name user state submit/start at > > > > > > > queue slots ja-task-ID > > > > > > > ----------------------------------------------------------------------------------------------------------------- > > > > > > > 156757 0.50000 small.bash joe.borg qw 01/13/2014 16:45:18 > > > > > > > 1 > > > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014 16:55:31 > > > > > > > 2 > > > > > > > 156762 0.50000 small.bash joe.borg qw 01/13/2014 16:55:33 > > > > > > > 1 > > > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014 16:55:34 > > > > > > > 1 > > > > > > > > > > > > > > ...But when I release... > > > > > > > > > > > > max_reservation is set? > > > > > > > > > > > > But the reservation feature must also be seen in a running cluster. > > > > > > If all four jobs are on hold and released at once, I wouldn't be > > > > > > surprised if it's not strictly FIFO. > > > > > > > > > > > > > > > > > > > job-ID prior name user state submit/start at > > > > > > > queue slots ja-task-ID > > > > > > > ----------------------------------------------------------------------------------------------------------------- > > > > > > > 156757 0.50000 small.bash joe.borg r 01/13/2014 16:56:06 > > > > > > > test.q@test 1 > > > > > > > 156762 0.50000 small.bash joe.borg r 01/13/2014 16:56:06 > > > > > > > test.q@test 1 > > > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014 16:55:31 > > > > > > > 2 > > > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014 16:55:34 > > > > > > > 1 > > > > > > > > > > > > As job 156762 has the same runtime as 156757, backfilling will > > > > > > occur to use the otherwise idling core. Whether job 156762 is > > > > > > started or not, the parallel one 156761 will start at the same > > > > > > time. Only 156763 shouldn't start. > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > Joseph David Borġ > > > > > > > josephb.org > > > > > > > > > > > > > > > > > > > > > On 13 January 2014 17:26, Reuti <[email protected]> > > > > > > > wrote: > > > > > > > Am 13.01.2014 um 17:24 schrieb Joe Borġ: > > > > > > > > > > > > > > > Hi Reuti, > > > > > > > > > > > > > > > > I am using a PE, so that's fine. > > > > > > > > > > > > > > > > I've not set either of the other 3. Will the job be killed if > > > > > > > > default_duration is exceeded? > > > > > > > > > > > > > > No. It can be set to any value you like (like a few weeks), but > > > > > > > it shouldn't be set to "INFINITY" as SGE judges infinity being > > > > > > > smaller than infinity and so backfilling will always occur. > > > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > Joseph David Borġ > > > > > > > > josephb.org > > > > > > > > > > > > > > > > > > > > > > > > On 13 January 2014 16:16, Reuti <[email protected]> > > > > > > > > wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > Am 13.01.2014 um 16:58 schrieb Joe Borġ: > > > > > > > > > > > > > > > > > I'm trying to set up an SGE queue and am having a problem > > > > > > > > > getting the jobs to start in the right order. Here is my > > > > > > > > > example - test.q with 2 possible slots and the following jobs > > > > > > > > > queued: > > > > > > > > > > > > > > > > > > job-ID prior name user state submit/start at > > > > > > > > > queue slots ja-task-ID > > > > > > > > > ----------------------------------------------------------------------------------------------------------------- > > > > > > > > > 1 0.50000 small.bash joe.borg qw 01/13/2014 > > > > > > > > > 15:43:16 1 > > > > > > > > > 2 0.50000 big.bash joe.borg qw 01/13/2014 > > > > > > > > > 15:43:24 2 > > > > > > > > > 3 0.50000 small.bash joe.borg qw 01/13/2014 > > > > > > > > > 15:43:27 1 > > > > > > > > > 4 0.50000 small.bash joe.borg qw 01/13/2014 > > > > > > > > > 15:43:28 1 > > > > > > > > > > > > > > > > > > I want the jobs to run in that order, but (obviously), when I > > > > > > > > > enable the queue, the small jobs fill the available slots and > > > > > > > > > the big job has to wait for them to complete. I'd like it > > > > > > > > > setup so that only job 1 runs; finishes, then 2 (with both > > > > > > > > > slots), then the final 2 jobs, 3 & 4, together. > > > > > > > > > > > > > > > > > > I've looked at -R y on submission, but doesn't seem to work. > > > > > > > > > > > > > > > > For the reservation to work (and it's only necessary to request > > > > > > > > it for the parallel job) it's necessary to have suitable "h_rt" > > > > > > > > requests for all jobs. > > > > > > > > > > > > > > > > - Do you request any "h_rt" for all jobs? > > > > > > > > - Do you have a "default_duration" set to a proper value in the > > > > > > > > schedule configuration otherwise? > > > > > > > > - Is "max_reservation" set to a value like 16? > > > > > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Joseph David Borġ > > > > > > > > > josephb.org > > > > > > > > > _______________________________________________ > > > > > > > > > users mailing list > > > > > > > > > [email protected] > > > > > > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
