Re: [gridengine users] PE offers 0 slots?

2017-08-17 Thread Michael Stauffer
On Thu, Aug 17, 2017 at 7:49 AM, Reuti  wrote:

>
> > Am 13.08.2017 um 18:11 schrieb Michael Stauffer :
> >
> > Thanks for the reply Reuti, see below
> >
> > On Fri, Aug 11, 2017 at 7:18 PM, Reuti 
> wrote:
> >
> > What I notice below: defining h_vmem/s_vmem on a queue level means per
> job. Defining it on an exechost level means across all jobs. What is
> different between:
> >
> > > 
> -
> > > all.q@compute-0-13.local   BP0/10/169.14 lx-amd64
> > > qf:h_vmem=40.000G
> > > qf:s_vmem=40.000G
> > > hc:slots=6
> > > 
> -
> > > all.q@compute-0-14.local   BP0/10/169.66 lx-amd64
> > > hc:h_vmem=28.890G
> > > hc:s_vmem=30.990G
> > > hc:slots=6
> >
> >
> > qf = queue fixed
> > hc = host consumable
> >
> > What is the definition of h_vmem/s_vmem in `qconf -sc` and their default
> consumptions?
> >
> > I thought this means that when it's showing qf, it's the per-job queue
> limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G
> (which it does). And then hc is shown when the host resources are less than
> the per-job queue limit.
>
> Yes, the lower limit should be shown. So it's defined on both sides:
> exechost and queue?


Yes, the queue has a 40GB per-job limit, and h_vmem and s_vmem are
consumables on the exechosts

-M
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots?

2017-08-17 Thread Reuti

> Am 13.08.2017 um 18:11 schrieb Michael Stauffer :
> 
> Thanks for the reply Reuti, see below
> 
> On Fri, Aug 11, 2017 at 7:18 PM, Reuti  wrote:
> 
> What I notice below: defining h_vmem/s_vmem on a queue level means per job. 
> Defining it on an exechost level means across all jobs. What is different 
> between:
> 
> > -
> > all.q@compute-0-13.local   BP0/10/169.14 lx-amd64
> > qf:h_vmem=40.000G
> > qf:s_vmem=40.000G
> > hc:slots=6
> > -
> > all.q@compute-0-14.local   BP0/10/169.66 lx-amd64
> > hc:h_vmem=28.890G
> > hc:s_vmem=30.990G
> > hc:slots=6
> 
> 
> qf = queue fixed
> hc = host consumable
> 
> What is the definition of h_vmem/s_vmem in `qconf -sc` and their default 
> consumptions?
> 
> I thought this means that when it's showing qf, it's the per-job queue limit, 
> i.e. the queue has a h_vmem and s_vmem limits for the job of 40G (which it 
> does). And then hc is shown when the host resources are less than the per-job 
> queue limit.

Yes, the lower limit should be shown. So it's defined on both sides: exechost 
and queue?

-- Reuti


> [root@chead ~]# qconf -sc | grep vmem
> h_vmem  h_vmem MEMORY  <=YES JOB3100M 
>0
> s_vmem  s_vmem MEMORY  <=YES JOB3000M 
>0
> 
> > 'unihost' is the only PE I use. When users request multiple slots, they use 
> > 'unihost':
> >
> > qsub ... -binding linear:2 -pe unihost 2 ...
> >
> > What happens is that these jobs aren't running when it otherwise seems like 
> > they should be, or they sit waiting in the queue for a long time even when 
> > the user has plenty of quota available within the queue they've requested, 
> > and there are enough resources available on the queue's nodes per 
> > qhost(slots and vmem are consumables), and qquota isn't showing any rqs 
> > limits have been reached.
> >
> > Below I've dumped relevant configurations.
> >
> > Today I created a new PE called "int_test" to test the "integer" allocation 
> > rule. I set it to 16 (16 cores per node), and have also tried 8. It's been 
> > added as a PE to the queues we use. When I try to run to this new PE 
> > however, it *always* fails with the same "PE ...offers 0 slots" error, even 
> > if I can run the same multi-slot job using "unihost" PE at the same time. 
> > I'm not sure if this helps debug or not.
> >
> > Another thought - this behavior started happening some time ago more or 
> > less when I tried implementing fairshare behavior. I never seemed to get 
> > fairshare working right. We haven't been able to confirm, but for some 
> > users it seems this "PE 0 slots" issue pops up only after they've been 
> > running other jobs for a little while. So I'm wondering if I've screwed up 
> > fairshare in some way that's causing this odd behavior.
> >
> > The default queue from global config file is all.q.
> 
> There is no default queue in SGE. One specifies resource requests and SGE 
> will select an appropriate one. What do you refer to by this?
> 
> Do you have any sge_request or private .sge_request?
> 
> Yes, the global sge_request has '-q all.q'. I can't remember why this was 
> done when I first set things up years ago - I think the cluster I was 
> migrating from was set up that way and I just copied it.
> 
> Given my qconf '-ssconf' and '-sconf' output below, does something look off 
> with my fairshare setup (and subsequent attempt to disable it)? As I 
> mentioned, I'm wondering if something went wrong with how I set it up because 
> this intermittent behavior may have started at the same time.
> 
> -M 
> 
> >
> > Here are various config dumps. Is there anything else that might be helpful?
> >
> > Thanks for any help! This has been plaguing me.
> >
> >
> > [root@chead ~]# qconf -sp unihost
> > pe_nameunihost
> > slots  
> > user_lists NONE
> > xuser_listsNONE
> > start_proc_args/bin/true
> > stop_proc_args /bin/true
> > allocation_rule$pe_slots
> > control_slaves FALSE
> > job_is_first_task  TRUE
> > urgency_slots  min
> > accounting_summary FALSE
> > qsort_args NONE
> >
> > [root@chead ~]# qconf -sp int_test
> > pe_nameint_test
> > slots  
> > user_lists NONE
> > xuser_listsNONE
> > start_proc_args/bin/true
> > stop_proc_args /bin/true
> > allocation_rule8
> > control_slaves FALSE
> > job_is_first_task  TRUE
> > urgency_slots  min
> > accounting_summary FALSE
> > qsort_args NONE
> >
> > [root@chead ~]# qconf -ssconf
> > algorithm default
> > schedule_interval 0:0:5
> > maxujobs  200
> > 

Re: [gridengine users] PE offers 0 slots?

2017-08-15 Thread Michael Stauffer
I have a new insight which is very helpful. Thanks to Mark Bergman who
mentioned that the 'PE offers 0 slots' error/warning can also mean memory
limitations.

If the stuck-job problem is happening to a user, I can get jobs to run if I
make no memory request, or make a memory request  (i.e., -l h_vmem=...)
that's less than the default value for the complex. If I request more than
100M greater than the default, the job gets stuck with the "PE offers 0
slots" warning. Interesting!

Any thoughts on this? Again, this is happening when there's plenty of
resources on the nodes and plenty of room in the users quotas.

I'll test more tomorrow, but this may mean I can at least get a workaround
going by having a large default request and forcing users to make an
explicit memory request.

-M

On Tue, Aug 15, 2017 at 6:40 PM, Michael Stauffer 
wrote:

> ##
>> In regard of 'int_test' PE you created. If you set allocation rule to
>> integer, it would mean that the job _must_ request amount of slots equal or
>> multiple to this value.
>> In your case, PE is defined to use '8' as allocation rule, so your job
>> must request 8 or 16 or 24 ... slots. In case of you request 2, the job
>> will never start, as the scheduler can't allocate 2 slots with allocation
>> rule set to 8.
>>
>> From man sge_pe:
>> "If  the  number  of  tasks  specified with the "-pe" option (see
>> qsub(1)) does not  divide  without  remainder  by thisthe  job
>>  will not be scheduled. "
>>
>> So, the fact that the job in int_test never starts if it requests 2 cores
>> - is totally fine from the scheduler point of view.
>>
>
> OK, thanks very much, that explains it. I'll test accordingly.
>
>
>> ##
>> In regard of this issue in general: just wondering if you, or users on
>> the cluster use '-R y' ( reservation ) option for theirs jobs? I have seen
>> such a behavior, when someone submits a job with a reservation defined. The
>> scheduler reserves slots on the cluster for this big job, and doesn't let
>> new jobs come ( especially in case of runtime is not defined by h_rt ). In
>> this case, there will be no messages in the scheduler log which is
>> confusing some time.
>>
>
> I don't think users are using '-R y', but I'm not sure. Do you know how I
> can tell that? I think 'qstat -g c' shows that in the RES column? I don't
> think I've ever seen non-zero there, but I'll pay attention. However the
> stuck-job issue is happening right now to at least one user, and the RES
> column is all zeros.
>
> -M
>
>
>>
>> Best regards,
>> Mikhail Serkov
>>
>> On Fri, Aug 11, 2017 at 6:41 PM, Michael Stauffer 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> Below I've dumped relevant configurations.
>>>
>>> Today I created a new PE called "int_test" to test the "integer"
>>> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
>>> It's been added as a PE to the queues we use. When I try to run to this new
>>> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
>>> even if I can run the same multi-slot job using "unihost" PE at the same
>>> time. I'm not sure if this helps debug or not.
>>>
>>> Another thought - this behavior started happening some time ago more or
>>> less when I tried implementing fairshare behavior. I never seemed to get
>>> fairshare working right. We haven't been able to confirm, but for some
>>> users it seems this "PE 0 slots" issue pops up only after they've been
>>> running other jobs for a little while. So I'm wondering if I've screwed up
>>> fairshare in some way that's causing this odd behavior.
>>>
>>>
>>>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots?

2017-08-15 Thread Michael Stauffer
>
> ##
> In regard of 'int_test' PE you created. If you set allocation rule to
> integer, it would mean that the job _must_ request amount of slots equal or
> multiple to this value.
> In your case, PE is defined to use '8' as allocation rule, so your job
> must request 8 or 16 or 24 ... slots. In case of you request 2, the job
> will never start, as the scheduler can't allocate 2 slots with allocation
> rule set to 8.
>
> From man sge_pe:
> "If  the  number  of  tasks  specified with the "-pe" option (see qsub(1))
> does not  divide  without  remainder  by thisthe  job  will not be
> scheduled. "
>
> So, the fact that the job in int_test never starts if it requests 2 cores
> - is totally fine from the scheduler point of view.
>

OK, thanks very much, that explains it. I'll test accordingly.


> ##
> In regard of this issue in general: just wondering if you, or users on the
> cluster use '-R y' ( reservation ) option for theirs jobs? I have seen such
> a behavior, when someone submits a job with a reservation defined. The
> scheduler reserves slots on the cluster for this big job, and doesn't let
> new jobs come ( especially in case of runtime is not defined by h_rt ). In
> this case, there will be no messages in the scheduler log which is
> confusing some time.
>

I don't think users are using '-R y', but I'm not sure. Do you know how I
can tell that? I think 'qstat -g c' shows that in the RES column? I don't
think I've ever seen non-zero there, but I'll pay attention. However the
stuck-job issue is happening right now to at least one user, and the RES
column is all zeros.

-M


>
> Best regards,
> Mikhail Serkov
>
> On Fri, Aug 11, 2017 at 6:41 PM, Michael Stauffer 
> wrote:
>
>> Hi,
>>
>>
>> Below I've dumped relevant configurations.
>>
>> Today I created a new PE called "int_test" to test the "integer"
>> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
>> It's been added as a PE to the queues we use. When I try to run to this new
>> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
>> even if I can run the same multi-slot job using "unihost" PE at the same
>> time. I'm not sure if this helps debug or not.
>>
>> Another thought - this behavior started happening some time ago more or
>> less when I tried implementing fairshare behavior. I never seemed to get
>> fairshare working right. We haven't been able to confirm, but for some
>> users it seems this "PE 0 slots" issue pops up only after they've been
>> running other jobs for a little while. So I'm wondering if I've screwed up
>> fairshare in some way that's causing this odd behavior.
>>
>>
>>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots?

2017-08-14 Thread Mike Serkov
Hello Michael,

##
In regard of 'int_test' PE you created. If you set allocation rule to
integer, it would mean that the job _must_ request amount of slots equal or
multiple to this value.
In your case, PE is defined to use '8' as allocation rule, so your job must
request 8 or 16 or 24 ... slots. In case of you request 2, the job will
never start, as the scheduler can't allocate 2 slots with allocation rule
set to 8.

>From man sge_pe:
"If  the  number  of  tasks  specified with the "-pe" option (see qsub(1))
does not  divide  without  remainder  by thisthe  job  will not be
scheduled. "

So, the fact that the job in int_test never starts if it requests 2 cores -
is totally fine from the scheduler point of view.

##
In regard of this issue in general: just wondering if you, or users on the
cluster use '-R y' ( reservation ) option for theirs jobs? I have seen such
a behavior, when someone submits a job with a reservation defined. The
scheduler reserves slots on the cluster for this big job, and doesn't let
new jobs come ( especially in case of runtime is not defined by h_rt ). In
this case, there will be no messages in the scheduler log which is
confusing some time.

Best regards,
Mikhail Serkov

On Fri, Aug 11, 2017 at 6:41 PM, Michael Stauffer 
wrote:

> Hi,
>
>
> Below I've dumped relevant configurations.
>
> Today I created a new PE called "int_test" to test the "integer"
> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
> It's been added as a PE to the queues we use. When I try to run to this new
> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
> even if I can run the same multi-slot job using "unihost" PE at the same
> time. I'm not sure if this helps debug or not.
>
> Another thought - this behavior started happening some time ago more or
> less when I tried implementing fairshare behavior. I never seemed to get
> fairshare working right. We haven't been able to confirm, but for some
> users it seems this "PE 0 slots" issue pops up only after they've been
> running other jobs for a little while. So I'm wondering if I've screwed up
> fairshare in some way that's causing this odd behavior.
>
>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots?

2017-08-14 Thread Michael Stauffer
I have some more information.

We have two sets of exec hosts on the cluster, one in a the host
group/hostlist "@allhosts" that is assigned to the queue all.q. The other
is in the group "@basichosts", which is assigned to a queue called basic.q

When we're having the trouble with multi-slot/core jobs not running for a
user on all.q, the same jobs can be resubmitted (or added via qalter) to
basic.q, and they will run immediately.

I made a duplicate queue of all.q, called allalt.q. The same problem
happens with jobs getting stuck in queue. When I change the hostlist in
allalt.q, and nothing else, from @allhosts to @basichosts, the stuck jobs
run immediately. (Again, this is happenning when there are plenty of
resources reported available on all.q hosts, and the user's quotas are
either empty or not maxed.)

Here's the definitions of a host from each of the groups:

A host from all.q's group, @allhosts, where jobs get stuck:
[root@chead ~]# qconf -se compute-0-1

hostname  compute-0-1.local
load_scaling  NONE
complex_valuesh_vmem=125.49G,s_vmem=125.49G,slots=16.00
load_values   arch=lx-amd64,num_proc=16,mem_total=64508.523438M, \
  swap_total=31999.996094M,virtual_total=96508.519531M,
\
  m_topology=SS,m_socket=2,m_core=16, \
  m_thread=16,load_avg=7.59,load_short=7.66, \
  load_medium=7.59,load_long=7.30, \
  mem_free=53815.035156M,swap_free=31834.675781M, \
  virtual_free=85649.710938M,mem_used=10693.488281M, \
  swap_used=165.320312M,virtual_used=10858.808594M, \
  cpu=42.80,m_topology_inuse=SSccCccCCC, \
  np_load_avg=0.474375,np_load_short=0.478750, \
  np_load_medium=0.474375,np_load_long=0.456250
processors16
user_listsNONE
xuser_lists   NONE
projects  NONE
xprojects NONE
usage_scaling NONE
report_variables  NONE


And a host from basic.q's group, @basichosts, where jobs run immediately:
[root@chead ~]# qconf -se compute-1-0

hostname  compute-1-0.local
load_scaling  NONE
complex_valuesh_vmem=19.02G,s_vmem=19.02G,slots=8.00
load_values   arch=lx-amd64,num_proc=8,mem_total=16077.441406M, \
  swap_total=3999.996094M,virtual_total=20077.437500M, \
  m_topology=SS,m_socket=2,m_core=8,m_thread=8,
\
  load_avg=1.68,load_short=2.42, \
  load_medium=1.68,load_long=1.79, \
  mem_free=13408.687500M,swap_free=3973.464844M, \
  virtual_free=17382.152344M,mem_used=2668.753906M, \
  swap_used=26.531250M,virtual_used=2695.285156M, \
  cpu=16.40,m_topology_inuse=SccCCScCCC, \
  np_load_avg=0.21,np_load_short=0.302500, \
  np_load_medium=0.21,np_load_long=0.223750
processors8
user_listsNONE
xuser_lists   NONE
projects  NONE
xprojects NONE
usage_scaling NONE
report_variables  NONE


Here's the full complex config.
'slots' are listed as "YES" under consumable, whereas s_vmem and h_vmem are
listed as "JOB". Seems this should be OK, but maybe not? Also 'slots' has
urgency 1000, whereas others have 0.

[root@chead ~]# qconf -sc

#name   shortcut   typerelop requestable consumable
default  urgency
#
archa  RESTRING==YES NO
NONE 0
calendarc  RESTRING==YES NO
NONE 0
cpu cpuDOUBLE  >=YES NO 0
 0
display_win_gui dwgBOOL==YES NO 0
 0
h_core  h_core MEMORY  <=YES NO 0
 0
h_cpu   h_cpu  TIME<=YES NO
0:0:00
h_data  h_data MEMORY  <=YES NO 0
 0
h_fsize h_fsizeMEMORY  <=YES NO 0
 0
h_rss   h_rss  MEMORY  <=YES NO 0
 0
h_rth_rt   TIME<=YES NO
0:0:00
h_stack h_stackMEMORY  <=YES NO 0
 0
h_vmem  h_vmem MEMORY  <=YES JOB
 3100M0
hostnameh  HOST==YES NO
NONE 0
load_avgla DOUBLE  >=NO  NO 0
 0
load_long   ll DOUBLE  >=NO  NO 0
 0
load_medium lm DOUBLE  >=NO  NO 0
 

Re: [gridengine users] PE offers 0 slots?

2017-08-13 Thread Michael Stauffer
Thanks for the reply Reuti, see below

On Fri, Aug 11, 2017 at 7:18 PM, Reuti  wrote:

>
> What I notice below: defining h_vmem/s_vmem on a queue level means per
> job. Defining it on an exechost level means across all jobs. What is
> different between:
>
> > 
> -
> > all.q@compute-0-13.local   BP0/10/169.14 lx-amd64
> > qf:h_vmem=40.000G
> > qf:s_vmem=40.000G
> > hc:slots=6
> > 
> -
> > all.q@compute-0-14.local   BP0/10/169.66 lx-amd64
> > hc:h_vmem=28.890G
> > hc:s_vmem=30.990G
> > hc:slots=6
>
>
> qf = queue fixed
> hc = host consumable
>
> What is the definition of h_vmem/s_vmem in `qconf -sc` and their default
> consumptions?
>

I thought this means that when it's showing qf, it's the per-job queue
limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G
(which it does). And then hc is shown when the host resources are less than
the per-job queue limit.

[root@chead ~]# qconf -sc | grep vmem
h_vmem  h_vmem MEMORY  <=YES JOB
 3100M0
s_vmem  s_vmem MEMORY  <=YES JOB
 3000M0

> 'unihost' is the only PE I use. When users request multiple slots, they
> use 'unihost':
> >
> > qsub ... -binding linear:2 -pe unihost 2 ...
> >
> > What happens is that these jobs aren't running when it otherwise seems
> like they should be, or they sit waiting in the queue for a long time even
> when the user has plenty of quota available within the queue they've
> requested, and there are enough resources available on the queue's nodes
> per qhost(slots and vmem are consumables), and qquota isn't showing any rqs
> limits have been reached.
> >
> > Below I've dumped relevant configurations.
> >
> > Today I created a new PE called "int_test" to test the "integer"
> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
> It's been added as a PE to the queues we use. When I try to run to this new
> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
> even if I can run the same multi-slot job using "unihost" PE at the same
> time. I'm not sure if this helps debug or not.
> >
> > Another thought - this behavior started happening some time ago more or
> less when I tried implementing fairshare behavior. I never seemed to get
> fairshare working right. We haven't been able to confirm, but for some
> users it seems this "PE 0 slots" issue pops up only after they've been
> running other jobs for a little while. So I'm wondering if I've screwed up
> fairshare in some way that's causing this odd behavior.
> >
> > The default queue from global config file is all.q.
>
> There is no default queue in SGE. One specifies resource requests and SGE
> will select an appropriate one. What do you refer to by this?
>
> Do you have any sge_request or private .sge_request?
>

Yes, the global sge_request has '-q all.q'. I can't remember why this was
done when I first set things up years ago - I think the cluster I was
migrating from was set up that way and I just copied it.

Given my qconf '-ssconf' and '-sconf' output below, does something look off
with my fairshare setup (and subsequent attempt to disable it)? As I
mentioned, I'm wondering if something went wrong with how I set it up
because this intermittent behavior may have started at the same time.

-M

>
> > Here are various config dumps. Is there anything else that might be
> helpful?
> >
> > Thanks for any help! This has been plaguing me.
> >
> >
> > [root@chead ~]# qconf -sp unihost
> > pe_nameunihost
> > slots  
> > user_lists NONE
> > xuser_listsNONE
> > start_proc_args/bin/true
> > stop_proc_args /bin/true
> > allocation_rule$pe_slots
> > control_slaves FALSE
> > job_is_first_task  TRUE
> > urgency_slots  min
> > accounting_summary FALSE
> > qsort_args NONE
> >
> > [root@chead ~]# qconf -sp int_test
> > pe_nameint_test
> > slots  
> > user_lists NONE
> > xuser_listsNONE
> > start_proc_args/bin/true
> > stop_proc_args /bin/true
> > allocation_rule8
> > control_slaves FALSE
> > job_is_first_task  TRUE
> > urgency_slots  min
> > accounting_summary FALSE
> > qsort_args NONE
> >
> > [root@chead ~]# qconf -ssconf
> > algorithm default
> > schedule_interval 0:0:5
> > maxujobs  200
> > queue_sort_method load
> > job_load_adjustments  np_load_avg=0.50
> > load_adjustment_decay_time0:7:30
> > load_formula  np_load_avg
> > schedd_job_info   true
> > flush_submit_sec  0
> > flush_finish_sec  0
> 

Re: [gridengine users] PE offers 0 slots?

2017-08-11 Thread Reuti
Hi,

Am 12.08.2017 um 00:41 schrieb Michael Stauffer:

> Hi,
> 
> I'm getting back to this post finally. I've looked at the links and 
> suggestions in the two replies to my original post a few months ago, but they 
> haven't helped. Here's my original:
> 
> I'm getting some queued jobs with scheduling info that includes this line at 
> the end:
> 
> cannot run in PE "unihost" because it only offers 0 slots

What I notice below: defining h_vmem/s_vmem on a queue level means per job. 
Defining it on an exechost level means across all jobs. What is different 
between:

> -
> all.q@compute-0-13.local   BP0/10/169.14 lx-amd64
> qf:h_vmem=40.000G
> qf:s_vmem=40.000G
> hc:slots=6
> -
> all.q@compute-0-14.local   BP0/10/169.66 lx-amd64
> hc:h_vmem=28.890G
> hc:s_vmem=30.990G
> hc:slots=6


qf = queue fixed
hc = host consumable

What is the definition of h_vmem/s_vmem in `qconf -sc` and their default 
consumptions?


> 'unihost' is the only PE I use. When users request multiple slots, they use 
> 'unihost':
> 
> qsub ... -binding linear:2 -pe unihost 2 ...
> 
> What happens is that these jobs aren't running when it otherwise seems like 
> they should be, or they sit waiting in the queue for a long time even when 
> the user has plenty of quota available within the queue they've requested, 
> and there are enough resources available on the queue's nodes per qhost(slots 
> and vmem are consumables), and qquota isn't showing any rqs limits have been 
> reached.
> 
> Below I've dumped relevant configurations.
> 
> Today I created a new PE called "int_test" to test the "integer" allocation 
> rule. I set it to 16 (16 cores per node), and have also tried 8. It's been 
> added as a PE to the queues we use. When I try to run to this new PE however, 
> it *always* fails with the same "PE ...offers 0 slots" error, even if I can 
> run the same multi-slot job using "unihost" PE at the same time. I'm not sure 
> if this helps debug or not.
> 
> Another thought - this behavior started happening some time ago more or less 
> when I tried implementing fairshare behavior. I never seemed to get fairshare 
> working right. We haven't been able to confirm, but for some users it seems 
> this "PE 0 slots" issue pops up only after they've been running other jobs 
> for a little while. So I'm wondering if I've screwed up fairshare in some way 
> that's causing this odd behavior.
> 
> The default queue from global config file is all.q.

There is no default queue in SGE. One specifies resource requests and SGE will 
select an appropriate one. What do you refer to by this?

Do you have any sge_request or private .sge_request?

-- Reuti


> 
> Here are various config dumps. Is there anything else that might be helpful?
> 
> Thanks for any help! This has been plaguing me.
> 
> 
> [root@chead ~]# qconf -sp unihost
> pe_nameunihost
> slots  
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$pe_slots
> control_slaves FALSE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary FALSE
> qsort_args NONE
> 
> [root@chead ~]# qconf -sp int_test
> pe_nameint_test
> slots  
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule8
> control_slaves FALSE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary FALSE
> qsort_args NONE
> 
> [root@chead ~]# qconf -ssconf
> algorithm default
> schedule_interval 0:0:5
> maxujobs  200
> queue_sort_method load
> job_load_adjustments  np_load_avg=0.50
> load_adjustment_decay_time0:7:30
> load_formula  np_load_avg
> schedd_job_info   true
> flush_submit_sec  0
> flush_finish_sec  0
> paramsnone
> reprioritize_interval 0:0:0
> halftime  1
> usage_weight_list cpu=0.70,mem=0.20,io=0.10
> compensation_factor   5.00
> weight_user   0.25
> weight_project0.25
> weight_department 0.25
> weight_job0.25
> weight_tickets_functional 1000
> weight_tickets_share  10
> share_override_ticketsTRUE
> share_functional_shares   TRUE
> max_functional_jobs_to_schedule   2000
> report_pjob_tickets   TRUE
> max_pending_tasks_per_job 100
> halflife_decay_list   none
> 

Re: [gridengine users] PE offers 0 slots?

2017-08-11 Thread Michael Stauffer
Hi,

I'm getting back to this post finally. I've looked at the links and
suggestions in the two replies to my original post a few months ago, but
they haven't helped. Here's my original:

I'm getting some queued jobs with scheduling info that includes this line
at the end:

cannot run in PE "unihost" because it only offers 0 slots

'unihost' is the only PE I use. When users request multiple slots, they use
'unihost':

qsub ... -binding linear:2 -pe unihost 2 ...

What happens is that these jobs aren't running when it otherwise seems like
they should be, or they sit waiting in the queue for a long time even when
the user has plenty of quota available within the queue they've requested,
and there are enough resources available on the queue's nodes per
qhost(slots and vmem are consumables), and qquota isn't showing any rqs
limits have been reached.

Below I've dumped relevant configurations.

Today I created a new PE called "int_test" to test the "integer" allocation
rule. I set it to 16 (16 cores per node), and have also tried 8. It's been
added as a PE to the queues we use. When I try to run to this new PE
however, it *always* fails with the same "PE ...offers 0 slots" error, even
if I can run the same multi-slot job using "unihost" PE at the same time.
I'm not sure if this helps debug or not.

Another thought - this behavior started happening some time ago more or
less when I tried implementing fairshare behavior. I never seemed to get
fairshare working right. We haven't been able to confirm, but for some
users it seems this "PE 0 slots" issue pops up only after they've been
running other jobs for a little while. So I'm wondering if I've screwed up
fairshare in some way that's causing this odd behavior.

The default queue from global config file is all.q.

Here are various config dumps. Is there anything else that might be helpful?

Thanks for any help! This has been plaguing me.


[root@chead ~]# qconf -sp unihost

pe_nameunihost
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$pe_slots
control_slaves FALSE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary FALSE
qsort_args NONE


[root@chead ~]# qconf -sp int_test

pe_nameint_test
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule8
control_slaves FALSE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary FALSE
qsort_args NONE


[root@chead ~]# qconf -ssconf

algorithm default
schedule_interval 0:0:5
maxujobs  200
queue_sort_method load
job_load_adjustments  np_load_avg=0.50
load_adjustment_decay_time0:7:30
load_formula  np_load_avg
schedd_job_info   true
flush_submit_sec  0
flush_finish_sec  0
paramsnone
reprioritize_interval 0:0:0
halftime  1
usage_weight_list cpu=0.70,mem=0.20,io=0.10
compensation_factor   5.00
weight_user   0.25
weight_project0.25
weight_department 0.25
weight_job0.25
weight_tickets_functional 1000
weight_tickets_share  10
share_override_ticketsTRUE
share_functional_shares   TRUE
max_functional_jobs_to_schedule   2000
report_pjob_tickets   TRUE
max_pending_tasks_per_job 100
halflife_decay_list   none
policy_hierarchy  OS
weight_ticket 0.00
weight_waiting_time   1.00
weight_deadline   360.00
weight_urgency0.10
weight_priority   1.00
max_reservation   0
default_duration  INFINITY


[root@chead ~]# qconf -sconf

#global:
execd_spool_dir  /opt/sge/default/spool
mailer   /bin/mail
xterm/usr/bin/X11/xterm
load_sensor  none
prolog   none
epilog   none
shell_start_mode posix_compliant
login_shells sh,bash,ksh,csh,tcsh
min_uid  0
min_gid  0
user_lists   none
xuser_lists  none
projects none
xprojectsnone
enforce_project  false
enforce_user auto
load_report_time 00:00:40
max_unheard  00:05:00
reschedule_unknown   02:00:00
loglevel log_warning
administrator_mail   none
set_token_cmdnone
pag_cmd  none

Re: [gridengine users] PE offers 0 slots?

2017-02-13 Thread Jesse Becker

On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote:

SoGE 8.1.8

Hi,

I'm getting some queued jobs with scheduling info that includes this line
at the end:

cannot run in PE "unihost" because it only offers 0 slots

'unihost' is the only PE I use. When users request multiple slots, they use
'unihost':

... -binding linear:2 -pe unihost 2 ...

What happens is that these jobs aren't running when it otherwise seems like
they should be, or they sit waiting in the queue for a long time even when
the user has plenty of quota available within the queue they've requested,
and there are enough resources available on the queue's nodes (slots and
vram are consumables).

Any suggestions about how I might further understand this?


This *exact* problem has bitten me in the past.  It seems to crop up
about every 3 years--long enough to remember it was a problem, and long
enough to forget just what the [censored] I did to fix it.

As I recall, it has little to do with actual PEs, but everything to do
with complexes and resource requests.

You might glean a bit more information by running "qsub -w p" 
(or "-w e").


Take a look at these previous discussions:

http://gridengine.org/pipermail/users/2011-November/001932.html
http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700


--
Jesse Becker (Contractor)
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots?

2017-02-13 Thread Michael Stauffer
On Mon, Feb 13, 2017 at 2:32 PM, Luis Huang  wrote:

> Check to make sure you haven’t got any rqs interfering.
>

I don't see any rqs as interfering. qquota for the users in question
returns that the have quota available on queues to which their jobs are
submitted. And qstat on the queue shows available resources.


> I just the exact same problem and it turns out that RQS was limiting it.
>
>
>
> Also check your qconf –spl to make sure your PE has got enough slots.
>

The PE is assigned  slots, and the cluster has 500 total.

Thanks for the reply.

-M
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-29 Thread Reuti
Am 24.01.2013 um 18:54 schrieb Dave Love:

 [Excuse any duplicates -- I'm not sure if gridengine.org is tits-up
 again as well as our mail hub sulking at my laptop.]
 
 Reuti re...@staff.uni-marburg.de writes:
 
 I think that's an old version.  Suggestions are welcome for any
 improvements to the current one, which I tried to tidy up (from which
 http://arc.liv.ac.uk/SGE/htmlman/htmlman5/queue_conf.html is derived).
 
 Aha, I see. It's now at the beginning of the man page.
 
 But shouldn't the outer brackets being bold instead of the inner ones? The 
 outer ones are the meta-symbols.
 
 I don't think so.  Bold is meant to be literal, as in the SYNOPSIS.  I
 thought that was the closest to a proper convention; is that wrong or
 confusing?  (I hope not after trying to get the markup straight in a lot
 of places!)

Okay, I see.

-- Reuti


 I've been tempted to use mdoc, but it's substantial work to convert, and
 probably means re-introducing catman.
 
 -- 
 Community Grid Engine:  http://arc.liv.ac.uk/SGE/


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-24 Thread Dave Love
[Excuse any duplicates -- I'm not sure if gridengine.org is tits-up
again as well as our mail hub sulking at my laptop.]

Reuti re...@staff.uni-marburg.de writes:

 I think that's an old version.  Suggestions are welcome for any
 improvements to the current one, which I tried to tidy up (from which
 http://arc.liv.ac.uk/SGE/htmlman/htmlman5/queue_conf.html is derived).

 Aha, I see. It's now at the beginning of the man page.

 But shouldn't the outer brackets being bold instead of the inner ones? The 
 outer ones are the meta-symbols.

I don't think so.  Bold is meant to be literal, as in the SYNOPSIS.  I
thought that was the closest to a proper convention; is that wrong or
confusing?  (I hope not after trying to get the markup straight in a lot
of places!)

I've been tempted to use mdoc, but it's substantial work to convert, and
probably means re-introducing catman.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-18 Thread Reuti
Am 18.01.2013 um 17:24 schrieb Dave Love:

 Reuti re...@staff.uni-marburg.de writes:
 
 It's not limited to a PE list entry, but applies to all. It is
 explained at the beginning of `man queue_conf` under
 hostlist. Although it's hard to read due to the bracket being a meta
 symbol and a character to be typed. 
 
 I think that's an old version.  Suggestions are welcome for any
 improvements to the current one, which I tried to tidy up (from which
 http://arc.liv.ac.uk/SGE/htmlman/htmlman5/queue_conf.html is derived).

Aha, I see. It's now at the beginning of the man page.

But shouldn't the outer brackets being bold instead of the inner ones? The 
outer ones are the meta-symbols.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-12 Thread Reuti
Am 12.01.2013 um 01:04 schrieb berg...@merctech.com:

 Where is the syntax for the pe_list parameter documented? I looked for an
 explanation, but didn't find details or examples in the man pages. There were
 some previous discussions on the mailing list (mostly from you), but they
 don't provide a general syntax, just specific answers.

It's not limited to a PE list entry, but applies to all. It is explained at the 
beginning of `man queue_conf` under hostlist. Although it's hard to read due 
to the bracket being a meta symbol and a character to be typed. The 
parameters_specifier_syntax there is the one which you would also use for 
the default list and the complete list of parameters need to be entered there 
for specific hosts/groups as it will override the default list.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-11 Thread Reuti
Am 11.01.2013 um 23:16 schrieb berg...@merctech.com:

 
 I recently reconfigured our SGE (6.2u5) environment to better handle MPI jobs
 on a heterogeneous cluster. This seems to have caused a problem with the
 threaded (SMP) PE.
 
 Our PEs are:
 
   qconf -spl
   make(unused)
   openmpi-AMD
   openmpi-Intel
   threaded
 
 
 I'm using a JSV to allow users to request -pe openmpi and alter that
 to -pe openmpi-*. The two openmpi-* PEs are both assigned to the
 all.q, but only given a hostgroup with the appropriate servers. This
 works fine for OpenMPI jobs.
 
 The PE threaded is also assigned to the all.q. That PE should consist of
 all hosts in the queue.
 
   qconf -sq all.q | grep pe_list
   pe_list  threaded 
 make,[@mpi-AMD=openmpi-AMD],[@mpi-Intel=openmpi-Intel]

pe_list  make,[@mpi-AMD=openmpi-AMD threaded],[@mpi-Intel=openmpi-Intel 
threaded]

should do it - Reuti


 However, jobs submitted with a request for -pe threaded are not run. SGE
 claims that the PE is not assigned to any queue:
 
   qstat -j 5170487
   parallel environment:  threaded range: 4
   cannot run in queue all.q@c5-10 because PE threaded is not 
 in pe list
cannot run in queue all.q@c5-11 because PE threaded is not 
 in pe list
cannot run in queue all.q@c5-12 because PE threaded is not 
 in pe list
 
 
 I've tried assiging a hostgroup (@batch, the same as the hostgroup
 assigned to the all.q) to the threaded PE, but that puts the nodes
 into the c(onfiguration ambiguous) state.
 
 Any suggestions?
 
 Thanks,
 
 Mark
 ___
 users mailing list
 users@gridengine.org
 https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-11 Thread bergman


In the message dated: Fri, 11 Jan 2013 23:45:05 +0100,
The pithy ruminations from Reuti on 
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
 were:
= Am 11.01.2013 um 23:16 schrieb berg...@merctech.com:
= 
=  
[SNIP!]

=  
= qconf -sq all.q | grep pe_list
= pe_list  threaded 
make,[@mpi-AMD=openmpi-AMD],[@mpi-Intel=openmpi-Intel]
= 
= pe_list  make,[@mpi-AMD=openmpi-AMD threaded],[@mpi-Intel=openmpi-Intel 
threaded]
= 
= should do it - Reuti
= 

Yes, that fixed the problem.

Thank you very much for the prompt  accurate answer.

Where is the syntax for the pe_list parameter documented? I looked for an
explanation, but didn't find details or examples in the man pages. There were
some previous discussions on the mailing list (mostly from you), but they
don't provide a general syntax, just specific answers.

Thanks again,

Mark
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users