Hi, Am 20.04.2018 um 22:55 schrieb Ilya M:
> Hi Reuti, > > There are dozens on hosts in @gpu. In my test submissions, however, I am > using only one host that I specify with '-l hostname='. I disabled all other > queues on this host to make sure nothing else but my test jobs are running > there. > > BTW, after several hours, my PE 1 job went through. My submissions to regular > queue worked fine. > > > Update: As I was writing this response, I tried one change in the queue > configuration: I created a new host group with only one node it it and > changed my test queue to only run on that hostgroup. I submitted a couple of > PE jobs with allocation rules '1', '2', '4', and did not request a specific > hostname this time. The jobs started running immediately. And the old jobs > that had been waiting, also went through. > > After discovering that, I tested normal production queue, combining '-l > hostanme=' and '-pe'. These jobs did not run and 'qalter -w v' reported > "cannot run because it exceeds limit "ilya/////" in rule > "limit_slots_for_users/1" > > So in my cluster, there seems to be some issue with RQS, PE and '-l > hostname=' combination that makes jobs unschedulable. I wonder if anyone else > can reproduce this behavior to see if this is an SGE bug or some problem in > my configuration. Yes to two issues: - there is a problem with load valued and RQS used at the same time - there is a problem with -pe" and -l hostname=, but this can be circumvented by -q "*@myhost" instead of -l (at least when I used it last time) -- Reuti > > Ilya. > > > On Fri, Apr 20, 2018 at 12:34 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 20.04.2018 um 21:04 schrieb Ilya M: > > > Hello, > > > > I set up a test queue to test a new prolog/epilog scripts and I am seeing > > some strange behavior when I submit a PE job to this queue, which causes > > the job to not get scheduled forever or for a very long period of time. I > > tried several PE with allocation rules of '1', '2', '4'. All to no avail. > > Submitting a job without a PE makes it run immediately. I am using SGE > > 2.6u5. > > > > Checking why it is not running: > > $ qalter -w v 7301747 > > ... > > Job 7301747 cannot run because it exceeds limit "ilya/////" in rule > > "limit_slots_for_users/1" > > Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots > > This error message is often misleading, although there is a real reason > preventing the scheduling. > > > verification: no suitable queues > > > > $ qconf -sp pe_1 > > pe_name pe_1 > > slots 9999999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args startmpi.sh $pe_hostfile > > stop_proc_args stopmpi.sh $pe_hostfile > > allocation_rule 1 > > control_slaves TRUE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > > > $ qconf -srqs limit_slots_for_users > > { > > name limit_slots_for_users > > description "limit the number of simultaneous slots any user can use" > > enabled TRUE > > limit users {*} to slots=800 > > } > > > > And finally, > > $ qstat > > job-ID prior name user state submit/start at queue > > slots ja-task-ID > > ----------------------------------------------------------------------------------------------------------------- > > 7301584 0.60051 sleep ilya qw 04/20/2018 18:29:26 > > 4 > > 7301747 0.50051 sleep ilya qw 04/20/2018 18:36:23 > > 1 > > > > So I am not running anything at the moment. If I submit a job with the same > > PE to a production queue, it will get scheduled. > > > > A job that I left hanging last night, finally got scheduled after 7-8 hours. > > > > The test queue is a follows: > > qconf -sq test_gpu.q > > qname test_gpu.q > > hostlist @gpu > > How many hosts are in @gpu? The allocation_rule 1 means exactly one slot per > machine – not as often 1 as the node is filled (this is different form > Torque, where this can be assigned several times per host). > > > > seq_no 0 > > load_thresholds np_load_avg=1.75 > > suspend_thresholds NONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processors UNDEFINED > > qtype BATCH INTERACTIVE > > ckpt_list NONE > > pe_list make pe_1 pe_2 pe_3 pe_4 pe_slots > > rerun TRUE > > slots 4 > > tmpdir /data > > shell /bin/sh > > prolog sgeg...@prolog.sh > > epilog sgeg...@epilog.sh > > shell_start_mode unix_behavior > > starter_method NONE > > suspend_method NONE > > resume_method NONE > > terminate_method custom_kill -p $job_pid -j $job_id > > I don't know about your custom_kill procedure, but it should kill -$job_pid, > i.e. the process group and not only a single process. > > - Reuti > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users