Re: [gridengine users] Some strange interaction of PE and RQS

Reuti Sun, 22 Apr 2018 13:05:52 -0700

Hi,

Am 20.04.2018 um 22:55 schrieb Ilya M:


> Hi Reuti, 
> 
> There are dozens on hosts in @gpu. In my test submissions, however, I am 
> using only one host that I specify with '-l hostname='. I disabled all other 
> queues on this host to make sure nothing else but my test jobs are running 
> there.
> 
> BTW, after several hours, my PE 1 job went through. My submissions to regular 
> queue worked fine.
> 
> 
> Update: As I was writing this response, I tried one change in the queue 
> configuration: I created a new host group with only one node it it and 
> changed my test queue to only run on that hostgroup. I submitted a couple of 
> PE jobs with allocation rules '1', '2', '4', and did not request a specific 
> hostname this time. The jobs started running immediately. And the old jobs 
> that had been waiting, also went through.
> 
> After discovering that, I tested normal production queue, combining '-l 
> hostanme=' and '-pe'. These jobs did not run and 'qalter -w v' reported 
> "cannot run because it exceeds limit "ilya/////" in rule 
> "limit_slots_for_users/1"
> 
> So in my cluster, there seems to be some issue with RQS, PE and '-l 
> hostname=' combination that makes jobs unschedulable. I wonder if anyone else 
> can reproduce this behavior to see if this is an SGE bug or some problem in 
> my configuration.

Yes to two issues:

- there is a problem with load valued and RQS used at the same time
- there is a problem with -pe" and -l hostname=, but this can be circumvented 
by -q "*@myhost" instead of -l (at least when I used it last time)

-- Reuti


> 
> Ilya.
> 
> 
> On Fri, Apr 20, 2018 at 12:34 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 20.04.2018 um 21:04 schrieb Ilya M:
> 
> > Hello,
> > 
> > I set up a test queue to test a new prolog/epilog scripts and I am seeing 
> > some strange behavior when I submit a PE job to this queue, which causes 
> > the job to not get scheduled forever or for a very long period of time. I 
> > tried several PE with allocation rules of '1', '2', '4'. All to no avail. 
> > Submitting a job without a PE makes it run immediately. I am using SGE 
> > 2.6u5.
> > 
> > Checking why it is not running:
> > $ qalter -w v 7301747
> > ...
> > Job 7301747 cannot run because it exceeds limit "ilya/////" in rule 
> > "limit_slots_for_users/1"
> > Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots
> 
> This error message is often misleading, although there is a real reason 
> preventing the scheduling.
> 
> > verification: no suitable queues
> > 
> > $ qconf -sp pe_1
> > pe_name            pe_1
> > slots              9999999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    startmpi.sh $pe_hostfile
> > stop_proc_args     stopmpi.sh $pe_hostfile
> > allocation_rule    1
> > control_slaves     TRUE
> > job_is_first_task  TRUE
> > urgency_slots      min
> > accounting_summary FALSE
> > 
> > $ qconf -srqs limit_slots_for_users
> > {
> >    name         limit_slots_for_users
> >    description  "limit the number of simultaneous slots any user can use"
> >    enabled      TRUE
> >    limit        users {*} to slots=800
> > }
> > 
> > And finally, 
> > $ qstat
> > job-ID  prior   name       user         state submit/start at     queue     
> >                      slots ja-task-ID 
> > -----------------------------------------------------------------------------------------------------------------
> > 7301584 0.60051 sleep      ilya        qw    04/20/2018 18:29:26            
> >                         4        
> > 7301747 0.50051 sleep      ilya        qw    04/20/2018 18:36:23            
> >                         1        
> > 
> > So I am not running anything at the moment. If I submit a job with the same 
> > PE to a production queue, it will get scheduled.
> > 
> > A job that I left hanging last night, finally got scheduled after 7-8 hours.
> > 
> > The test queue is a follows:
> > qconf -sq test_gpu.q
> > qname                 test_gpu.q
> > hostlist              @gpu
> 
> How many hosts are in @gpu? The allocation_rule 1 means exactly one slot per 
> machine – not as often 1 as the node is filled (this is different form 
> Torque, where this can be assigned several times per host).
> 
> 
> > seq_no                0
> > load_thresholds       np_load_avg=1.75
> > suspend_thresholds    NONE
> > nsuspend              1
> > suspend_interval      00:05:00
> > priority              0
> > min_cpu_interval      00:05:00
> > processors            UNDEFINED
> > qtype                 BATCH INTERACTIVE
> > ckpt_list             NONE
> > pe_list               make pe_1 pe_2 pe_3 pe_4 pe_slots
> > rerun                 TRUE
> > slots                 4
> > tmpdir                /data
> > shell                 /bin/sh
> > prolog                sgeg...@prolog.sh
> > epilog                sgeg...@epilog.sh
> > shell_start_mode      unix_behavior
> > starter_method        NONE
> > suspend_method        NONE
> > resume_method         NONE
> > terminate_method      custom_kill -p $job_pid -j $job_id
> 
> I don't know about your custom_kill procedure, but it should kill -$job_pid, 
> i.e. the process group and not only a single process.
> 
> - Reuti
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Some strange interaction of PE and RQS

Reply via email to