Same sleep job, ran for 12 hours without any issue in PE either. I can't see any CPU / MEM / IO restrictions set though, so I can't see what's killing the proper jobs.
Regards, Joseph David Borġ josephb.org On 31 January 2014 08:53, Joe Borġ <[email protected]> wrote: > OK, I'm now testing the same job, within a parallel environment, to see if > it has an affect. > > I can't see anywhere that I set CPU limits. > > > > Regards, > Joseph David Borġ > josephb.org > > > On 30 January 2014 11:22, Reuti <[email protected]> wrote: > >> Am 30.01.2014 um 11:33 schrieb Joe Borġ: >> >> > Did you want the output? I submitted then in a job, without a PE, on a >> different queue and it lasted the time it was meant to: >> > >> > $ cat test.bash.o164678 >> > core file size (blocks, -c) unlimited >> > data seg size (kbytes, -d) unlimited >> > scheduling priority (-e) 0 >> > file size (blocks, -f) unlimited >> > pending signals (-i) 256644 >> > max locked memory (kbytes, -l) unlimited >> > max memory size (kbytes, -m) unlimited >> > open files (-n) 4096 >> > pipe size (512 bytes, -p) 8 >> > POSIX message queues (bytes, -q) 819200 >> > real-time priority (-r) 0 >> > stack size (kbytes, -s) unlimited >> > cpu time (seconds, -t) unlimited >> >> Yes, I thought about a cpu time limit by the kernel - which is not the >> case. >> >> Anyway: do you observe a difference in used CPU time and used wallclock >> time as a trigger when a job is killed? As this job ran to completion, it >> might be a CPU time limit somewhere. >> >> -- Reuti >> >> >> > max user processes (-u) 256644 >> > virtual memory (kbytes, -v) unlimited >> > file locks (-x) unlimited >> > core file size (blocks, -c) unlimited >> > data seg size (kbytes, -d) unlimited >> > scheduling priority (-e) 0 >> > file size (blocks, -f) unlimited >> > pending signals (-i) 256644 >> > max locked memory (kbytes, -l) unlimited >> > max memory size (kbytes, -m) unlimited >> > open files (-n) 1024 >> > pipe size (512 bytes, -p) 8 >> > POSIX message queues (bytes, -q) 819200 >> > real-time priority (-r) 0 >> > stack size (kbytes, -s) unlimited >> > cpu time (seconds, -t) unlimited >> > max user processes (-u) 256644 >> > virtual memory (kbytes, -v) unlimited >> > file locks (-x) unlimited >> > Sleeping for 12 hours zzZZZ >> > Waking up >> > >> > >> > >> > >> > Regards, >> > Joseph David Borġ >> > josephb.org >> > >> > >> > On 27 January 2014 14:22, Reuti <[email protected]> wrote: >> > Am 27.01.2014 um 11:26 schrieb Joe Borġ: >> > >> > > They're pretty mammoth cluster nodes. But anyway, the load is set to >> 99 to stop it from rejecting jobs based on load (in this test queue). >> > >> > If you don't want load_thresholds, it can just be set to NONE. >> > >> > >> > > We're not currently using the checkpoint feature either. >> > >> > Can you please submit a job with the two statements: >> > >> > ulimit -aH >> > ulimit -aS >> > >> > -- Reuti >> > >> > >> > > Regards, >> > > Joseph David Borġ >> > > josephb.org >> > > >> > > >> > > On 23 January 2014 10:11, Reuti <[email protected]> wrote: >> > > Hi, >> > > >> > > Am 23.01.2014 um 10:17 schrieb Joe Borġ: >> > > >> > > > First off, here is the qconf sq. Anything obvious? >> > > > >> > > > qname test.q >> > > > hostlist @test_rack1 @test_rack2 \ >> > > > @test_rack3 >> > > > seq_no 0 >> > > > load_thresholds load_short=99 >> > > >> > > What type of machine is this, with a threshold of 99? >> > > >> > > > suspend_thresholds NONE >> > > > nsuspend 1 >> > > > suspend_interval 00:05:00 >> > > > priority 0 >> > > > min_cpu_interval 00:05:00 >> > > > processors UNDEFINED >> > > > qtype BATCH INTERACTIVE >> > > > ckpt_list test.ckpt >> > > >> > > We these jobs submitted using a checkpointing interface? >> > > Is this queue subordinated to any other queue? >> > > >> > > --Reuti >> > > >> > > >> > > > pe_list test.pe >> > > > rerun TRUE >> > > > slots 1,[@test_rack1=8],[@test_rack2=8], \ >> > > > [@test_rack3=8] >> > > > tmpdir /tmp >> > > > shell /bin/sh >> > > > prolog NONE >> > > > epilog NONE >> > > > shell_start_mode unix_behavior >> > > > starter_method NONE >> > > > suspend_method NONE >> > > > resume_method NONE >> > > > terminate_method NONE >> > > > notify 00:00:60 >> > > > owner_list NONE >> > > > user_lists NONE >> > > > xuser_lists NONE >> > > > subordinate_list NONE >> > > > complex_values NONE >> > > > projects NONE >> > > > xprojects NONE >> > > > calendar NONE >> > > > initial_state default >> > > > s_rt INFINITY >> > > > h_rt INFINITY >> > > > s_cpu INFINITY >> > > > h_cpu INFINITY >> > > > s_fsize INFINITY >> > > > h_fsize INFINITY >> > > > s_data INFINITY >> > > > h_data INFINITY >> > > > s_stack INFINITY >> > > > h_stack INFINITY >> > > > s_core INFINITY >> > > > h_core INFINITY >> > > > s_rss INFINITY >> > > > h_rss INFINITY >> > > > s_vmem INFINITY >> > > > h_vmem INFINITY >> > > > >> > > > >> > > > >> > > > >> > > > Regards, >> > > > Joseph David Borġ >> > > > josephb.org >> > > > >> > > > >> > > > On 17 January 2014 15:00, Reuti <[email protected]> wrote: >> > > > Can you please post the output of `qconf -sq <qname>` for the queue >> in question and the output of `qstat -j <job_id>` for such a job. >> > > > >> > > > -- Reuti >> > > > >> > > > >> > > > Am 16.01.2014 um 23:45 schrieb Joe Borġ: >> > > > >> > > > > Only in qsub >> > > > > >> > > > > >> > > > > >> > > > > Regards, >> > > > > Joseph David Borġ >> > > > > josephb.org >> > > > > >> > > > > >> > > > > On 16 January 2014 16:59, Reuti <[email protected]> >> wrote: >> > > > > Am 16.01.2014 um 17:17 schrieb Joe Borġ: >> > > > > >> > > > > > I checked with qstat -j and the value displayed (in seconds) >> were correct. >> > > > > > >> > > > > > With qacct, I get >> > > > > > >> > > > > > failed 100 : assumedly after job >> > > > > > exit_status 137 >> > > > > > >> > > > > > Seeing as they were all killed at the exact same run time, I >> can't see what else could have done it. >> > > > > >> > > > > As mentioned, was there something in the messages file like: >> > > > > >> > > > > 01/16/2014 17:54:46| main|pc15370|W|job 10561.1 exceeded hard >> wallclock time - initiate terminate method >> > > > > >> > > > > Is the limit in the `qsub` command or in the queue definition? >> > > > > >> > > > > -- Reuti >> > > > > >> > > > > >> > > > > > Regards, >> > > > > > Joseph David Borġ >> > > > > > josephb.org >> > > > > > >> > > > > > >> > > > > > On 15 January 2014 20:30, Reuti <[email protected]> >> wrote: >> > > > > > Hi, >> > > > > > >> > > > > > Am 15.01.2014 um 18:55 schrieb Joe Borġ: >> > > > > > >> > > > > > > I have it working, except even if I put jobs run time as 24 >> hours, they all get killed after 6hours 40mins. >> > > > > > >> > > > > > 6h 40m = 360m + 40m = 400m = 24000s - did you forget by >> accident the colons when you defined the limit? >> > > > > > >> > > > > > >> > > > > > > Looking at qstat -j shows the correct number of seconds >> against hard_resource_list h_rt. >> > > > > > > >> > > > > > > Any ideas? >> > > > > > >> > > > > > Was it really killed by SGE: is there any hint in the messages >> file of the node, i.e. something like /var/spool/sge/node01/messages about >> the reason for the kill ("loglevel log_info" in the `qconf -mconf`)?. >> > > > > > >> > > > > > -- Reuti >> > > > > > >> > > > > > >> > > > > > > Regards, >> > > > > > > Joseph David Borġ >> > > > > > > josephb.org >> > > > > > > >> > > > > > > >> > > > > > > On 15 January 2014 10:24, Reuti <[email protected]> >> wrote: >> > > > > > > Hi, >> > > > > > > >> > > > > > > Am 15.01.2014 um 11:16 schrieb Joe Borġ: >> > > > > > > >> > > > > > > > Using h_rt kills the job after the allotted time. >> > > > > > > >> > > > > > > Yes. >> > > > > > > >> > > > > > > >> > > > > > > > Can't this be disabled? >> > > > > > > >> > > > > > > There is no feature in SGE to extend the granted runtime of a >> job (I heard such a thing is available in Torque). >> > > > > > > >> > > > > > > >> > > > > > > > We only want to use it as a rough guide. >> > > > > > > >> > > > > > > If you want to do it only once in a time for a particular job: >> > > > > > > >> > > > > > > In this case you can just kill (or softstop) the `sgeexecd` >> on the node. You will lose control of the jobs on the node and the node >> (from SGE's view - `qhost` shows "-" for the node's load). So you have to >> check from time to time whether the job in question finished already, and >> then restart the `sgeexecd`. Also no new jobs will be scheduled to the node. >> > > > > > > >> > > > > > > Only at point of restarting the `sgeexecd` it will discover >> that the job finished (and send an email if applicable). Other (still) >> running jobs will gain supervision of their runtime again. >> > > > > > > >> > > > > > > -- Reuti >> > > > > > > >> > > > > > > >> > > > > > > > Thanks >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > Regards, >> > > > > > > > Joseph David Borġ >> > > > > > > > josephb.org >> > > > > > > > >> > > > > > > > >> > > > > > > > On 13 January 2014 17:43, Reuti <[email protected]> >> wrote: >> > > > > > > > Am 13.01.2014 um 18:33 schrieb Joe Borġ: >> > > > > > > > >> > > > > > > > > Thanks. Can you please tell me what I'm doing wrong? >> > > > > > > > > >> > > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash >> > > > > > > > > qsub -q test.q -R y -l h_rt=120 -pe test.pe 2 big.bash >> > > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash >> > > > > > > > > qsub -q test.q -R y -l h_rt=60 -pe test.pe 1 small.bash >> > > > > > > > >> > > > > > > > Only the parallel job needs "-R y". >> > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > job-ID prior name user state >> submit/start at queue slots ja-task-ID >> > > > > > > > > >> ----------------------------------------------------------------------------------------------------------------- >> > > > > > > > > 156757 0.50000 small.bash joe.borg qw 01/13/2014 >> 16:45:18 1 >> > > > > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014 >> 16:55:31 2 >> > > > > > > > > 156762 0.50000 small.bash joe.borg qw 01/13/2014 >> 16:55:33 1 >> > > > > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014 >> 16:55:34 1 >> > > > > > > > > >> > > > > > > > > ...But when I release... >> > > > > > > > >> > > > > > > > max_reservation is set? >> > > > > > > > >> > > > > > > > But the reservation feature must also be seen in a running >> cluster. If all four jobs are on hold and released at once, I wouldn't be >> surprised if it's not strictly FIFO. >> > > > > > > > >> > > > > > > > >> > > > > > > > > job-ID prior name user state >> submit/start at queue slots ja-task-ID >> > > > > > > > > >> ----------------------------------------------------------------------------------------------------------------- >> > > > > > > > > 156757 0.50000 small.bash joe.borg r 01/13/2014 >> 16:56:06 test.q@test 1 >> > > > > > > > > 156762 0.50000 small.bash joe.borg r 01/13/2014 >> 16:56:06 test.q@test 1 >> > > > > > > > > 156761 0.50000 big.bash joe.borg qw 01/13/2014 >> 16:55:31 2 >> > > > > > > > > 156763 0.50000 small.bash joe.borg qw 01/13/2014 >> 16:55:34 1 >> > > > > > > > >> > > > > > > > As job 156762 has the same runtime as 156757, backfilling >> will occur to use the otherwise idling core. Whether job 156762 is started >> or not, the parallel one 156761 will start at the same time. Only 156763 >> shouldn't start. >> > > > > > > > >> > > > > > > > -- Reuti >> > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > Thanks >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > Regards, >> > > > > > > > > Joseph David Borġ >> > > > > > > > > josephb.org >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On 13 January 2014 17:26, Reuti < >> [email protected]> wrote: >> > > > > > > > > Am 13.01.2014 um 17:24 schrieb Joe Borġ: >> > > > > > > > > >> > > > > > > > > > Hi Reuti, >> > > > > > > > > > >> > > > > > > > > > I am using a PE, so that's fine. >> > > > > > > > > > >> > > > > > > > > > I've not set either of the other 3. Will the job be >> killed if default_duration is exceeded? >> > > > > > > > > >> > > > > > > > > No. It can be set to any value you like (like a few >> weeks), but it shouldn't be set to "INFINITY" as SGE judges infinity being >> smaller than infinity and so backfilling will always occur. >> > > > > > > > > >> > > > > > > > > -- Reuti >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > Thanks >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > Regards, >> > > > > > > > > > Joseph David Borġ >> > > > > > > > > > josephb.org >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On 13 January 2014 16:16, Reuti < >> [email protected]> wrote: >> > > > > > > > > > Hi, >> > > > > > > > > > >> > > > > > > > > > Am 13.01.2014 um 16:58 schrieb Joe Borġ: >> > > > > > > > > > >> > > > > > > > > > > I'm trying to set up an SGE queue and am having a >> problem getting the jobs to start in the right order. Here is my example - >> test.q with 2 possible slots and the following jobs queued: >> > > > > > > > > > > >> > > > > > > > > > > job-ID prior name user state >> submit/start at queue slots ja-task-ID >> > > > > > > > > > > >> ----------------------------------------------------------------------------------------------------------------- >> > > > > > > > > > > 1 0.50000 small.bash joe.borg qw >> 01/13/2014 15:43:16 1 >> > > > > > > > > > > 2 0.50000 big.bash joe.borg qw >> 01/13/2014 15:43:24 2 >> > > > > > > > > > > 3 0.50000 small.bash joe.borg qw >> 01/13/2014 15:43:27 1 >> > > > > > > > > > > 4 0.50000 small.bash joe.borg qw >> 01/13/2014 15:43:28 1 >> > > > > > > > > > > >> > > > > > > > > > > I want the jobs to run in that order, but >> (obviously), when I enable the queue, the small jobs fill the available >> slots and the big job has to wait for them to complete. I'd like it setup >> so that only job 1 runs; finishes, then 2 (with both slots), then the final >> 2 jobs, 3 & 4, together. >> > > > > > > > > > > >> > > > > > > > > > > I've looked at -R y on submission, but doesn't seem >> to work. >> > > > > > > > > > >> > > > > > > > > > For the reservation to work (and it's only necessary to >> request it for the parallel job) it's necessary to have suitable "h_rt" >> requests for all jobs. >> > > > > > > > > > >> > > > > > > > > > - Do you request any "h_rt" for all jobs? >> > > > > > > > > > - Do you have a "default_duration" set to a proper >> value in the schedule configuration otherwise? >> > > > > > > > > > - Is "max_reservation" set to a value like 16? >> > > > > > > > > > >> > > > > > > > > > -- Reuti >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > Regards, >> > > > > > > > > > > Joseph David Borġ >> > > > > > > > > > > josephb.org >> > > > > > > > > > > _______________________________________________ >> > > > > > > > > > > users mailing list >> > > > > > > > > > > [email protected] >> > > > > > > > > > > https://gridengine.org/mailman/listinfo/users >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > > >> > > > >> > > >> > > >> > >> > >> >> >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
