Re: [gridengine users] PE offers 0 slots?
On Thu, Aug 17, 2017 at 7:49 AM, Reutiwrote: > > > Am 13.08.2017 um 18:11 schrieb Michael Stauffer : > > > > Thanks for the reply Reuti, see below > > > > On Fri, Aug 11, 2017 at 7:18 PM, Reuti > wrote: > > > > What I notice below: defining h_vmem/s_vmem on a queue level means per > job. Defining it on an exechost level means across all jobs. What is > different between: > > > > > > - > > > all.q@compute-0-13.local BP0/10/169.14 lx-amd64 > > > qf:h_vmem=40.000G > > > qf:s_vmem=40.000G > > > hc:slots=6 > > > > - > > > all.q@compute-0-14.local BP0/10/169.66 lx-amd64 > > > hc:h_vmem=28.890G > > > hc:s_vmem=30.990G > > > hc:slots=6 > > > > > > qf = queue fixed > > hc = host consumable > > > > What is the definition of h_vmem/s_vmem in `qconf -sc` and their default > consumptions? > > > > I thought this means that when it's showing qf, it's the per-job queue > limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G > (which it does). And then hc is shown when the host resources are less than > the per-job queue limit. > > Yes, the lower limit should be shown. So it's defined on both sides: > exechost and queue? Yes, the queue has a 40GB per-job limit, and h_vmem and s_vmem are consumables on the exechosts -M ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots?
> Am 13.08.2017 um 18:11 schrieb Michael Stauffer: > > Thanks for the reply Reuti, see below > > On Fri, Aug 11, 2017 at 7:18 PM, Reuti wrote: > > What I notice below: defining h_vmem/s_vmem on a queue level means per job. > Defining it on an exechost level means across all jobs. What is different > between: > > > - > > all.q@compute-0-13.local BP0/10/169.14 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > - > > all.q@compute-0-14.local BP0/10/169.66 lx-amd64 > > hc:h_vmem=28.890G > > hc:s_vmem=30.990G > > hc:slots=6 > > > qf = queue fixed > hc = host consumable > > What is the definition of h_vmem/s_vmem in `qconf -sc` and their default > consumptions? > > I thought this means that when it's showing qf, it's the per-job queue limit, > i.e. the queue has a h_vmem and s_vmem limits for the job of 40G (which it > does). And then hc is shown when the host resources are less than the per-job > queue limit. Yes, the lower limit should be shown. So it's defined on both sides: exechost and queue? -- Reuti > [root@chead ~]# qconf -sc | grep vmem > h_vmem h_vmem MEMORY <=YES JOB3100M >0 > s_vmem s_vmem MEMORY <=YES JOB3000M >0 > > > 'unihost' is the only PE I use. When users request multiple slots, they use > > 'unihost': > > > > qsub ... -binding linear:2 -pe unihost 2 ... > > > > What happens is that these jobs aren't running when it otherwise seems like > > they should be, or they sit waiting in the queue for a long time even when > > the user has plenty of quota available within the queue they've requested, > > and there are enough resources available on the queue's nodes per > > qhost(slots and vmem are consumables), and qquota isn't showing any rqs > > limits have been reached. > > > > Below I've dumped relevant configurations. > > > > Today I created a new PE called "int_test" to test the "integer" allocation > > rule. I set it to 16 (16 cores per node), and have also tried 8. It's been > > added as a PE to the queues we use. When I try to run to this new PE > > however, it *always* fails with the same "PE ...offers 0 slots" error, even > > if I can run the same multi-slot job using "unihost" PE at the same time. > > I'm not sure if this helps debug or not. > > > > Another thought - this behavior started happening some time ago more or > > less when I tried implementing fairshare behavior. I never seemed to get > > fairshare working right. We haven't been able to confirm, but for some > > users it seems this "PE 0 slots" issue pops up only after they've been > > running other jobs for a little while. So I'm wondering if I've screwed up > > fairshare in some way that's causing this odd behavior. > > > > The default queue from global config file is all.q. > > There is no default queue in SGE. One specifies resource requests and SGE > will select an appropriate one. What do you refer to by this? > > Do you have any sge_request or private .sge_request? > > Yes, the global sge_request has '-q all.q'. I can't remember why this was > done when I first set things up years ago - I think the cluster I was > migrating from was set up that way and I just copied it. > > Given my qconf '-ssconf' and '-sconf' output below, does something look off > with my fairshare setup (and subsequent attempt to disable it)? As I > mentioned, I'm wondering if something went wrong with how I set it up because > this intermittent behavior may have started at the same time. > > -M > > > > > Here are various config dumps. Is there anything else that might be helpful? > > > > Thanks for any help! This has been plaguing me. > > > > > > [root@chead ~]# qconf -sp unihost > > pe_nameunihost > > slots > > user_lists NONE > > xuser_listsNONE > > start_proc_args/bin/true > > stop_proc_args /bin/true > > allocation_rule$pe_slots > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -sp int_test > > pe_nameint_test > > slots > > user_lists NONE > > xuser_listsNONE > > start_proc_args/bin/true > > stop_proc_args /bin/true > > allocation_rule8 > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -ssconf > > algorithm default > > schedule_interval 0:0:5 > > maxujobs 200 > >
Re: [gridengine users] PE offers 0 slots?
I have a new insight which is very helpful. Thanks to Mark Bergman who mentioned that the 'PE offers 0 slots' error/warning can also mean memory limitations. If the stuck-job problem is happening to a user, I can get jobs to run if I make no memory request, or make a memory request (i.e., -l h_vmem=...) that's less than the default value for the complex. If I request more than 100M greater than the default, the job gets stuck with the "PE offers 0 slots" warning. Interesting! Any thoughts on this? Again, this is happening when there's plenty of resources on the nodes and plenty of room in the users quotas. I'll test more tomorrow, but this may mean I can at least get a workaround going by having a large default request and forcing users to make an explicit memory request. -M On Tue, Aug 15, 2017 at 6:40 PM, Michael Staufferwrote: > ## >> In regard of 'int_test' PE you created. If you set allocation rule to >> integer, it would mean that the job _must_ request amount of slots equal or >> multiple to this value. >> In your case, PE is defined to use '8' as allocation rule, so your job >> must request 8 or 16 or 24 ... slots. In case of you request 2, the job >> will never start, as the scheduler can't allocate 2 slots with allocation >> rule set to 8. >> >> From man sge_pe: >> "If the number of tasks specified with the "-pe" option (see >> qsub(1)) does not divide without remainder by thisthe job >> will not be scheduled. " >> >> So, the fact that the job in int_test never starts if it requests 2 cores >> - is totally fine from the scheduler point of view. >> > > OK, thanks very much, that explains it. I'll test accordingly. > > >> ## >> In regard of this issue in general: just wondering if you, or users on >> the cluster use '-R y' ( reservation ) option for theirs jobs? I have seen >> such a behavior, when someone submits a job with a reservation defined. The >> scheduler reserves slots on the cluster for this big job, and doesn't let >> new jobs come ( especially in case of runtime is not defined by h_rt ). In >> this case, there will be no messages in the scheduler log which is >> confusing some time. >> > > I don't think users are using '-R y', but I'm not sure. Do you know how I > can tell that? I think 'qstat -g c' shows that in the RES column? I don't > think I've ever seen non-zero there, but I'll pay attention. However the > stuck-job issue is happening right now to at least one user, and the RES > column is all zeros. > > -M > > >> >> Best regards, >> Mikhail Serkov >> >> On Fri, Aug 11, 2017 at 6:41 PM, Michael Stauffer >> wrote: >> >>> Hi, >>> >>> >>> Below I've dumped relevant configurations. >>> >>> Today I created a new PE called "int_test" to test the "integer" >>> allocation rule. I set it to 16 (16 cores per node), and have also tried 8. >>> It's been added as a PE to the queues we use. When I try to run to this new >>> PE however, it *always* fails with the same "PE ...offers 0 slots" error, >>> even if I can run the same multi-slot job using "unihost" PE at the same >>> time. I'm not sure if this helps debug or not. >>> >>> Another thought - this behavior started happening some time ago more or >>> less when I tried implementing fairshare behavior. I never seemed to get >>> fairshare working right. We haven't been able to confirm, but for some >>> users it seems this "PE 0 slots" issue pops up only after they've been >>> running other jobs for a little while. So I'm wondering if I've screwed up >>> fairshare in some way that's causing this odd behavior. >>> >>> >>> > ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots?
> > ## > In regard of 'int_test' PE you created. If you set allocation rule to > integer, it would mean that the job _must_ request amount of slots equal or > multiple to this value. > In your case, PE is defined to use '8' as allocation rule, so your job > must request 8 or 16 or 24 ... slots. In case of you request 2, the job > will never start, as the scheduler can't allocate 2 slots with allocation > rule set to 8. > > From man sge_pe: > "If the number of tasks specified with the "-pe" option (see qsub(1)) > does not divide without remainder by thisthe job will not be > scheduled. " > > So, the fact that the job in int_test never starts if it requests 2 cores > - is totally fine from the scheduler point of view. > OK, thanks very much, that explains it. I'll test accordingly. > ## > In regard of this issue in general: just wondering if you, or users on the > cluster use '-R y' ( reservation ) option for theirs jobs? I have seen such > a behavior, when someone submits a job with a reservation defined. The > scheduler reserves slots on the cluster for this big job, and doesn't let > new jobs come ( especially in case of runtime is not defined by h_rt ). In > this case, there will be no messages in the scheduler log which is > confusing some time. > I don't think users are using '-R y', but I'm not sure. Do you know how I can tell that? I think 'qstat -g c' shows that in the RES column? I don't think I've ever seen non-zero there, but I'll pay attention. However the stuck-job issue is happening right now to at least one user, and the RES column is all zeros. -M > > Best regards, > Mikhail Serkov > > On Fri, Aug 11, 2017 at 6:41 PM, Michael Stauffer> wrote: > >> Hi, >> >> >> Below I've dumped relevant configurations. >> >> Today I created a new PE called "int_test" to test the "integer" >> allocation rule. I set it to 16 (16 cores per node), and have also tried 8. >> It's been added as a PE to the queues we use. When I try to run to this new >> PE however, it *always* fails with the same "PE ...offers 0 slots" error, >> even if I can run the same multi-slot job using "unihost" PE at the same >> time. I'm not sure if this helps debug or not. >> >> Another thought - this behavior started happening some time ago more or >> less when I tried implementing fairshare behavior. I never seemed to get >> fairshare working right. We haven't been able to confirm, but for some >> users it seems this "PE 0 slots" issue pops up only after they've been >> running other jobs for a little while. So I'm wondering if I've screwed up >> fairshare in some way that's causing this odd behavior. >> >> >> ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots?
Hello Michael, ## In regard of 'int_test' PE you created. If you set allocation rule to integer, it would mean that the job _must_ request amount of slots equal or multiple to this value. In your case, PE is defined to use '8' as allocation rule, so your job must request 8 or 16 or 24 ... slots. In case of you request 2, the job will never start, as the scheduler can't allocate 2 slots with allocation rule set to 8. >From man sge_pe: "If the number of tasks specified with the "-pe" option (see qsub(1)) does not divide without remainder by thisthe job will not be scheduled. " So, the fact that the job in int_test never starts if it requests 2 cores - is totally fine from the scheduler point of view. ## In regard of this issue in general: just wondering if you, or users on the cluster use '-R y' ( reservation ) option for theirs jobs? I have seen such a behavior, when someone submits a job with a reservation defined. The scheduler reserves slots on the cluster for this big job, and doesn't let new jobs come ( especially in case of runtime is not defined by h_rt ). In this case, there will be no messages in the scheduler log which is confusing some time. Best regards, Mikhail Serkov On Fri, Aug 11, 2017 at 6:41 PM, Michael Staufferwrote: > Hi, > > > Below I've dumped relevant configurations. > > Today I created a new PE called "int_test" to test the "integer" > allocation rule. I set it to 16 (16 cores per node), and have also tried 8. > It's been added as a PE to the queues we use. When I try to run to this new > PE however, it *always* fails with the same "PE ...offers 0 slots" error, > even if I can run the same multi-slot job using "unihost" PE at the same > time. I'm not sure if this helps debug or not. > > Another thought - this behavior started happening some time ago more or > less when I tried implementing fairshare behavior. I never seemed to get > fairshare working right. We haven't been able to confirm, but for some > users it seems this "PE 0 slots" issue pops up only after they've been > running other jobs for a little while. So I'm wondering if I've screwed up > fairshare in some way that's causing this odd behavior. > > > ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots?
I have some more information. We have two sets of exec hosts on the cluster, one in a the host group/hostlist "@allhosts" that is assigned to the queue all.q. The other is in the group "@basichosts", which is assigned to a queue called basic.q When we're having the trouble with multi-slot/core jobs not running for a user on all.q, the same jobs can be resubmitted (or added via qalter) to basic.q, and they will run immediately. I made a duplicate queue of all.q, called allalt.q. The same problem happens with jobs getting stuck in queue. When I change the hostlist in allalt.q, and nothing else, from @allhosts to @basichosts, the stuck jobs run immediately. (Again, this is happenning when there are plenty of resources reported available on all.q hosts, and the user's quotas are either empty or not maxed.) Here's the definitions of a host from each of the groups: A host from all.q's group, @allhosts, where jobs get stuck: [root@chead ~]# qconf -se compute-0-1 hostname compute-0-1.local load_scaling NONE complex_valuesh_vmem=125.49G,s_vmem=125.49G,slots=16.00 load_values arch=lx-amd64,num_proc=16,mem_total=64508.523438M, \ swap_total=31999.996094M,virtual_total=96508.519531M, \ m_topology=SS,m_socket=2,m_core=16, \ m_thread=16,load_avg=7.59,load_short=7.66, \ load_medium=7.59,load_long=7.30, \ mem_free=53815.035156M,swap_free=31834.675781M, \ virtual_free=85649.710938M,mem_used=10693.488281M, \ swap_used=165.320312M,virtual_used=10858.808594M, \ cpu=42.80,m_topology_inuse=SSccCccCCC, \ np_load_avg=0.474375,np_load_short=0.478750, \ np_load_medium=0.474375,np_load_long=0.456250 processors16 user_listsNONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE And a host from basic.q's group, @basichosts, where jobs run immediately: [root@chead ~]# qconf -se compute-1-0 hostname compute-1-0.local load_scaling NONE complex_valuesh_vmem=19.02G,s_vmem=19.02G,slots=8.00 load_values arch=lx-amd64,num_proc=8,mem_total=16077.441406M, \ swap_total=3999.996094M,virtual_total=20077.437500M, \ m_topology=SS,m_socket=2,m_core=8,m_thread=8, \ load_avg=1.68,load_short=2.42, \ load_medium=1.68,load_long=1.79, \ mem_free=13408.687500M,swap_free=3973.464844M, \ virtual_free=17382.152344M,mem_used=2668.753906M, \ swap_used=26.531250M,virtual_used=2695.285156M, \ cpu=16.40,m_topology_inuse=SccCCScCCC, \ np_load_avg=0.21,np_load_short=0.302500, \ np_load_medium=0.21,np_load_long=0.223750 processors8 user_listsNONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE Here's the full complex config. 'slots' are listed as "YES" under consumable, whereas s_vmem and h_vmem are listed as "JOB". Seems this should be OK, but maybe not? Also 'slots' has urgency 1000, whereas others have 0. [root@chead ~]# qconf -sc #name shortcut typerelop requestable consumable default urgency # archa RESTRING==YES NO NONE 0 calendarc RESTRING==YES NO NONE 0 cpu cpuDOUBLE >=YES NO 0 0 display_win_gui dwgBOOL==YES NO 0 0 h_core h_core MEMORY <=YES NO 0 0 h_cpu h_cpu TIME<=YES NO 0:0:00 h_data h_data MEMORY <=YES NO 0 0 h_fsize h_fsizeMEMORY <=YES NO 0 0 h_rss h_rss MEMORY <=YES NO 0 0 h_rth_rt TIME<=YES NO 0:0:00 h_stack h_stackMEMORY <=YES NO 0 0 h_vmem h_vmem MEMORY <=YES JOB 3100M0 hostnameh HOST==YES NO NONE 0 load_avgla DOUBLE >=NO NO 0 0 load_long ll DOUBLE >=NO NO 0 0 load_medium lm DOUBLE >=NO NO 0
Re: [gridengine users] PE offers 0 slots?
Thanks for the reply Reuti, see below On Fri, Aug 11, 2017 at 7:18 PM, Reutiwrote: > > What I notice below: defining h_vmem/s_vmem on a queue level means per > job. Defining it on an exechost level means across all jobs. What is > different between: > > > > - > > all.q@compute-0-13.local BP0/10/169.14 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > > - > > all.q@compute-0-14.local BP0/10/169.66 lx-amd64 > > hc:h_vmem=28.890G > > hc:s_vmem=30.990G > > hc:slots=6 > > > qf = queue fixed > hc = host consumable > > What is the definition of h_vmem/s_vmem in `qconf -sc` and their default > consumptions? > I thought this means that when it's showing qf, it's the per-job queue limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G (which it does). And then hc is shown when the host resources are less than the per-job queue limit. [root@chead ~]# qconf -sc | grep vmem h_vmem h_vmem MEMORY <=YES JOB 3100M0 s_vmem s_vmem MEMORY <=YES JOB 3000M0 > 'unihost' is the only PE I use. When users request multiple slots, they > use 'unihost': > > > > qsub ... -binding linear:2 -pe unihost 2 ... > > > > What happens is that these jobs aren't running when it otherwise seems > like they should be, or they sit waiting in the queue for a long time even > when the user has plenty of quota available within the queue they've > requested, and there are enough resources available on the queue's nodes > per qhost(slots and vmem are consumables), and qquota isn't showing any rqs > limits have been reached. > > > > Below I've dumped relevant configurations. > > > > Today I created a new PE called "int_test" to test the "integer" > allocation rule. I set it to 16 (16 cores per node), and have also tried 8. > It's been added as a PE to the queues we use. When I try to run to this new > PE however, it *always* fails with the same "PE ...offers 0 slots" error, > even if I can run the same multi-slot job using "unihost" PE at the same > time. I'm not sure if this helps debug or not. > > > > Another thought - this behavior started happening some time ago more or > less when I tried implementing fairshare behavior. I never seemed to get > fairshare working right. We haven't been able to confirm, but for some > users it seems this "PE 0 slots" issue pops up only after they've been > running other jobs for a little while. So I'm wondering if I've screwed up > fairshare in some way that's causing this odd behavior. > > > > The default queue from global config file is all.q. > > There is no default queue in SGE. One specifies resource requests and SGE > will select an appropriate one. What do you refer to by this? > > Do you have any sge_request or private .sge_request? > Yes, the global sge_request has '-q all.q'. I can't remember why this was done when I first set things up years ago - I think the cluster I was migrating from was set up that way and I just copied it. Given my qconf '-ssconf' and '-sconf' output below, does something look off with my fairshare setup (and subsequent attempt to disable it)? As I mentioned, I'm wondering if something went wrong with how I set it up because this intermittent behavior may have started at the same time. -M > > > Here are various config dumps. Is there anything else that might be > helpful? > > > > Thanks for any help! This has been plaguing me. > > > > > > [root@chead ~]# qconf -sp unihost > > pe_nameunihost > > slots > > user_lists NONE > > xuser_listsNONE > > start_proc_args/bin/true > > stop_proc_args /bin/true > > allocation_rule$pe_slots > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -sp int_test > > pe_nameint_test > > slots > > user_lists NONE > > xuser_listsNONE > > start_proc_args/bin/true > > stop_proc_args /bin/true > > allocation_rule8 > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -ssconf > > algorithm default > > schedule_interval 0:0:5 > > maxujobs 200 > > queue_sort_method load > > job_load_adjustments np_load_avg=0.50 > > load_adjustment_decay_time0:7:30 > > load_formula np_load_avg > > schedd_job_info true > > flush_submit_sec 0 > > flush_finish_sec 0 >
Re: [gridengine users] PE offers 0 slots?
Hi, Am 12.08.2017 um 00:41 schrieb Michael Stauffer: > Hi, > > I'm getting back to this post finally. I've looked at the links and > suggestions in the two replies to my original post a few months ago, but they > haven't helped. Here's my original: > > I'm getting some queued jobs with scheduling info that includes this line at > the end: > > cannot run in PE "unihost" because it only offers 0 slots What I notice below: defining h_vmem/s_vmem on a queue level means per job. Defining it on an exechost level means across all jobs. What is different between: > - > all.q@compute-0-13.local BP0/10/169.14 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=6 > - > all.q@compute-0-14.local BP0/10/169.66 lx-amd64 > hc:h_vmem=28.890G > hc:s_vmem=30.990G > hc:slots=6 qf = queue fixed hc = host consumable What is the definition of h_vmem/s_vmem in `qconf -sc` and their default consumptions? > 'unihost' is the only PE I use. When users request multiple slots, they use > 'unihost': > > qsub ... -binding linear:2 -pe unihost 2 ... > > What happens is that these jobs aren't running when it otherwise seems like > they should be, or they sit waiting in the queue for a long time even when > the user has plenty of quota available within the queue they've requested, > and there are enough resources available on the queue's nodes per qhost(slots > and vmem are consumables), and qquota isn't showing any rqs limits have been > reached. > > Below I've dumped relevant configurations. > > Today I created a new PE called "int_test" to test the "integer" allocation > rule. I set it to 16 (16 cores per node), and have also tried 8. It's been > added as a PE to the queues we use. When I try to run to this new PE however, > it *always* fails with the same "PE ...offers 0 slots" error, even if I can > run the same multi-slot job using "unihost" PE at the same time. I'm not sure > if this helps debug or not. > > Another thought - this behavior started happening some time ago more or less > when I tried implementing fairshare behavior. I never seemed to get fairshare > working right. We haven't been able to confirm, but for some users it seems > this "PE 0 slots" issue pops up only after they've been running other jobs > for a little while. So I'm wondering if I've screwed up fairshare in some way > that's causing this odd behavior. > > The default queue from global config file is all.q. There is no default queue in SGE. One specifies resource requests and SGE will select an appropriate one. What do you refer to by this? Do you have any sge_request or private .sge_request? -- Reuti > > Here are various config dumps. Is there anything else that might be helpful? > > Thanks for any help! This has been plaguing me. > > > [root@chead ~]# qconf -sp unihost > pe_nameunihost > slots > user_lists NONE > xuser_listsNONE > start_proc_args/bin/true > stop_proc_args /bin/true > allocation_rule$pe_slots > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min > accounting_summary FALSE > qsort_args NONE > > [root@chead ~]# qconf -sp int_test > pe_nameint_test > slots > user_lists NONE > xuser_listsNONE > start_proc_args/bin/true > stop_proc_args /bin/true > allocation_rule8 > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min > accounting_summary FALSE > qsort_args NONE > > [root@chead ~]# qconf -ssconf > algorithm default > schedule_interval 0:0:5 > maxujobs 200 > queue_sort_method load > job_load_adjustments np_load_avg=0.50 > load_adjustment_decay_time0:7:30 > load_formula np_load_avg > schedd_job_info true > flush_submit_sec 0 > flush_finish_sec 0 > paramsnone > reprioritize_interval 0:0:0 > halftime 1 > usage_weight_list cpu=0.70,mem=0.20,io=0.10 > compensation_factor 5.00 > weight_user 0.25 > weight_project0.25 > weight_department 0.25 > weight_job0.25 > weight_tickets_functional 1000 > weight_tickets_share 10 > share_override_ticketsTRUE > share_functional_shares TRUE > max_functional_jobs_to_schedule 2000 > report_pjob_tickets TRUE > max_pending_tasks_per_job 100 > halflife_decay_list none >
Re: [gridengine users] PE offers 0 slots?
Hi, I'm getting back to this post finally. I've looked at the links and suggestions in the two replies to my original post a few months ago, but they haven't helped. Here's my original: I'm getting some queued jobs with scheduling info that includes this line at the end: cannot run in PE "unihost" because it only offers 0 slots 'unihost' is the only PE I use. When users request multiple slots, they use 'unihost': qsub ... -binding linear:2 -pe unihost 2 ... What happens is that these jobs aren't running when it otherwise seems like they should be, or they sit waiting in the queue for a long time even when the user has plenty of quota available within the queue they've requested, and there are enough resources available on the queue's nodes per qhost(slots and vmem are consumables), and qquota isn't showing any rqs limits have been reached. Below I've dumped relevant configurations. Today I created a new PE called "int_test" to test the "integer" allocation rule. I set it to 16 (16 cores per node), and have also tried 8. It's been added as a PE to the queues we use. When I try to run to this new PE however, it *always* fails with the same "PE ...offers 0 slots" error, even if I can run the same multi-slot job using "unihost" PE at the same time. I'm not sure if this helps debug or not. Another thought - this behavior started happening some time ago more or less when I tried implementing fairshare behavior. I never seemed to get fairshare working right. We haven't been able to confirm, but for some users it seems this "PE 0 slots" issue pops up only after they've been running other jobs for a little while. So I'm wondering if I've screwed up fairshare in some way that's causing this odd behavior. The default queue from global config file is all.q. Here are various config dumps. Is there anything else that might be helpful? Thanks for any help! This has been plaguing me. [root@chead ~]# qconf -sp unihost pe_nameunihost slots user_lists NONE xuser_listsNONE start_proc_args/bin/true stop_proc_args /bin/true allocation_rule$pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE qsort_args NONE [root@chead ~]# qconf -sp int_test pe_nameint_test slots user_lists NONE xuser_listsNONE start_proc_args/bin/true stop_proc_args /bin/true allocation_rule8 control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE qsort_args NONE [root@chead ~]# qconf -ssconf algorithm default schedule_interval 0:0:5 maxujobs 200 queue_sort_method load job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time0:7:30 load_formula np_load_avg schedd_job_info true flush_submit_sec 0 flush_finish_sec 0 paramsnone reprioritize_interval 0:0:0 halftime 1 usage_weight_list cpu=0.70,mem=0.20,io=0.10 compensation_factor 5.00 weight_user 0.25 weight_project0.25 weight_department 0.25 weight_job0.25 weight_tickets_functional 1000 weight_tickets_share 10 share_override_ticketsTRUE share_functional_shares TRUE max_functional_jobs_to_schedule 2000 report_pjob_tickets TRUE max_pending_tasks_per_job 100 halflife_decay_list none policy_hierarchy OS weight_ticket 0.00 weight_waiting_time 1.00 weight_deadline 360.00 weight_urgency0.10 weight_priority 1.00 max_reservation 0 default_duration INFINITY [root@chead ~]# qconf -sconf #global: execd_spool_dir /opt/sge/default/spool mailer /bin/mail xterm/usr/bin/X11/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells sh,bash,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojectsnone enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 02:00:00 loglevel log_warning administrator_mail none set_token_cmdnone pag_cmd none
Re: [gridengine users] PE offers 0 slots?
On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote: SoGE 8.1.8 Hi, I'm getting some queued jobs with scheduling info that includes this line at the end: cannot run in PE "unihost" because it only offers 0 slots 'unihost' is the only PE I use. When users request multiple slots, they use 'unihost': ... -binding linear:2 -pe unihost 2 ... What happens is that these jobs aren't running when it otherwise seems like they should be, or they sit waiting in the queue for a long time even when the user has plenty of quota available within the queue they've requested, and there are enough resources available on the queue's nodes (slots and vram are consumables). Any suggestions about how I might further understand this? This *exact* problem has bitten me in the past. It seems to crop up about every 3 years--long enough to remember it was a problem, and long enough to forget just what the [censored] I did to fix it. As I recall, it has little to do with actual PEs, but everything to do with complexes and resource requests. You might glean a bit more information by running "qsub -w p" (or "-w e"). Take a look at these previous discussions: http://gridengine.org/pipermail/users/2011-November/001932.html http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700 -- Jesse Becker (Contractor) ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots?
On Mon, Feb 13, 2017 at 2:32 PM, Luis Huangwrote: > Check to make sure you haven’t got any rqs interfering. > I don't see any rqs as interfering. qquota for the users in question returns that the have quota available on queues to which their jobs are submitted. And qstat on the queue shows available resources. > I just the exact same problem and it turns out that RQS was limiting it. > > > > Also check your qconf –spl to make sure your PE has got enough slots. > The PE is assigned slots, and the cluster has 500 total. Thanks for the reply. -M ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
Am 24.01.2013 um 18:54 schrieb Dave Love: [Excuse any duplicates -- I'm not sure if gridengine.org is tits-up again as well as our mail hub sulking at my laptop.] Reuti re...@staff.uni-marburg.de writes: I think that's an old version. Suggestions are welcome for any improvements to the current one, which I tried to tidy up (from which http://arc.liv.ac.uk/SGE/htmlman/htmlman5/queue_conf.html is derived). Aha, I see. It's now at the beginning of the man page. But shouldn't the outer brackets being bold instead of the inner ones? The outer ones are the meta-symbols. I don't think so. Bold is meant to be literal, as in the SYNOPSIS. I thought that was the closest to a proper convention; is that wrong or confusing? (I hope not after trying to get the markup straight in a lot of places!) Okay, I see. -- Reuti I've been tempted to use mdoc, but it's substantial work to convert, and probably means re-introducing catman. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
[Excuse any duplicates -- I'm not sure if gridengine.org is tits-up again as well as our mail hub sulking at my laptop.] Reuti re...@staff.uni-marburg.de writes: I think that's an old version. Suggestions are welcome for any improvements to the current one, which I tried to tidy up (from which http://arc.liv.ac.uk/SGE/htmlman/htmlman5/queue_conf.html is derived). Aha, I see. It's now at the beginning of the man page. But shouldn't the outer brackets being bold instead of the inner ones? The outer ones are the meta-symbols. I don't think so. Bold is meant to be literal, as in the SYNOPSIS. I thought that was the closest to a proper convention; is that wrong or confusing? (I hope not after trying to get the markup straight in a lot of places!) I've been tempted to use mdoc, but it's substantial work to convert, and probably means re-introducing catman. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
Am 18.01.2013 um 17:24 schrieb Dave Love: Reuti re...@staff.uni-marburg.de writes: It's not limited to a PE list entry, but applies to all. It is explained at the beginning of `man queue_conf` under hostlist. Although it's hard to read due to the bracket being a meta symbol and a character to be typed. I think that's an old version. Suggestions are welcome for any improvements to the current one, which I tried to tidy up (from which http://arc.liv.ac.uk/SGE/htmlman/htmlman5/queue_conf.html is derived). Aha, I see. It's now at the beginning of the man page. But shouldn't the outer brackets being bold instead of the inner ones? The outer ones are the meta-symbols. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
Am 12.01.2013 um 01:04 schrieb berg...@merctech.com: Where is the syntax for the pe_list parameter documented? I looked for an explanation, but didn't find details or examples in the man pages. There were some previous discussions on the mailing list (mostly from you), but they don't provide a general syntax, just specific answers. It's not limited to a PE list entry, but applies to all. It is explained at the beginning of `man queue_conf` under hostlist. Although it's hard to read due to the bracket being a meta symbol and a character to be typed. The parameters_specifier_syntax there is the one which you would also use for the default list and the complete list of parameters need to be entered there for specific hosts/groups as it will override the default list. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
Am 11.01.2013 um 23:16 schrieb berg...@merctech.com: I recently reconfigured our SGE (6.2u5) environment to better handle MPI jobs on a heterogeneous cluster. This seems to have caused a problem with the threaded (SMP) PE. Our PEs are: qconf -spl make(unused) openmpi-AMD openmpi-Intel threaded I'm using a JSV to allow users to request -pe openmpi and alter that to -pe openmpi-*. The two openmpi-* PEs are both assigned to the all.q, but only given a hostgroup with the appropriate servers. This works fine for OpenMPI jobs. The PE threaded is also assigned to the all.q. That PE should consist of all hosts in the queue. qconf -sq all.q | grep pe_list pe_list threaded make,[@mpi-AMD=openmpi-AMD],[@mpi-Intel=openmpi-Intel] pe_list make,[@mpi-AMD=openmpi-AMD threaded],[@mpi-Intel=openmpi-Intel threaded] should do it - Reuti However, jobs submitted with a request for -pe threaded are not run. SGE claims that the PE is not assigned to any queue: qstat -j 5170487 parallel environment: threaded range: 4 cannot run in queue all.q@c5-10 because PE threaded is not in pe list cannot run in queue all.q@c5-11 because PE threaded is not in pe list cannot run in queue all.q@c5-12 because PE threaded is not in pe list I've tried assiging a hostgroup (@batch, the same as the hostgroup assigned to the all.q) to the threaded PE, but that puts the nodes into the c(onfiguration ambiguous) state. Any suggestions? Thanks, Mark ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
In the message dated: Fri, 11 Jan 2013 23:45:05 +0100, The pithy ruminations from Reuti on Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?) were: = Am 11.01.2013 um 23:16 schrieb berg...@merctech.com: = = [SNIP!] = = qconf -sq all.q | grep pe_list = pe_list threaded make,[@mpi-AMD=openmpi-AMD],[@mpi-Intel=openmpi-Intel] = = pe_list make,[@mpi-AMD=openmpi-AMD threaded],[@mpi-Intel=openmpi-Intel threaded] = = should do it - Reuti = Yes, that fixed the problem. Thank you very much for the prompt accurate answer. Where is the syntax for the pe_list parameter documented? I looked for an explanation, but didn't find details or examples in the man pages. There were some previous discussions on the mailing list (mostly from you), but they don't provide a general syntax, just specific answers. Thanks again, Mark ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users