Re: [gridengine users] PE offers 0 slots?

Reuti Fri, 11 Aug 2017 16:21:43 -0700

Hi,

Am 12.08.2017 um 00:41 schrieb Michael Stauffer:


> Hi,
> 
> I'm getting back to this post finally. I've looked at the links and 
> suggestions in the two replies to my original post a few months ago, but they 
> haven't helped. Here's my original:
> 
> I'm getting some queued jobs with scheduling info that includes this line at 
> the end:
> 
> cannot run in PE "unihost" because it only offers 0 slots

What I notice below: defining h_vmem/s_vmem on a queue level means per job. 
Defining it on an exechost level means across all jobs. What is different 
between:

> ---------------------------------------------------------------------------------
> all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=6
> ---------------------------------------------------------------------------------
> all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
>         hc:h_vmem=28.890G
>         hc:s_vmem=30.990G
>         hc:slots=6


qf = queue fixed
hc = host consumable

What is the definition of h_vmem/s_vmem in `qconf -sc` and their default 
consumptions?


> 'unihost' is the only PE I use. When users request multiple slots, they use 
> 'unihost':
> 
> qsub ... -binding linear:2 -pe unihost 2 ...
> 
> What happens is that these jobs aren't running when it otherwise seems like 
> they should be, or they sit waiting in the queue for a long time even when 
> the user has plenty of quota available within the queue they've requested, 
> and there are enough resources available on the queue's nodes per qhost(slots 
> and vmem are consumables), and qquota isn't showing any rqs limits have been 
> reached.
> 
> Below I've dumped relevant configurations.
> 
> Today I created a new PE called "int_test" to test the "integer" allocation 
> rule. I set it to 16 (16 cores per node), and have also tried 8. It's been 
> added as a PE to the queues we use. When I try to run to this new PE however, 
> it *always* fails with the same "PE ...offers 0 slots" error, even if I can 
> run the same multi-slot job using "unihost" PE at the same time. I'm not sure 
> if this helps debug or not.
> 
> Another thought - this behavior started happening some time ago more or less 
> when I tried implementing fairshare behavior. I never seemed to get fairshare 
> working right. We haven't been able to confirm, but for some users it seems 
> this "PE 0 slots" issue pops up only after they've been running other jobs 
> for a little while. So I'm wondering if I've screwed up fairshare in some way 
> that's causing this odd behavior.
> 
> The default queue from global config file is all.q.

There is no default queue in SGE. One specifies resource requests and SGE will 
select an appropriate one. What do you refer to by this?

Do you have any sge_request or private .sge_request?

-- Reuti


> 
> Here are various config dumps. Is there anything else that might be helpful?
> 
> Thanks for any help! This has been plaguing me.
> 
> 
> [root@chead ~]# qconf -sp unihost
> pe_name            unihost
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $pe_slots
> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
> qsort_args         NONE
> 
> [root@chead ~]# qconf -sp int_test
> pe_name            int_test
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    8
> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
> qsort_args         NONE
> 
> [root@chead ~]# qconf -ssconf
> algorithm                         default
> schedule_interval                 0:0:5
> maxujobs                          200
> queue_sort_method                 load
> job_load_adjustments              np_load_avg=0.50
> load_adjustment_decay_time        0:7:30
> load_formula                      np_load_avg
> schedd_job_info                   true
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> reprioritize_interval             0:0:0
> halftime                          1
> usage_weight_list                 cpu=0.700000,mem=0.200000,io=0.100000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         1000
> weight_tickets_share              100000
> share_override_tickets            TRUE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   2000
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         100
> halflife_decay_list               none
> policy_hierarchy                  OS
> weight_ticket                     0.000000
> weight_waiting_time               1.000000
> weight_deadline                   3600000.000000
> weight_urgency                    0.100000
> weight_priority                   1.000000
> max_reservation                   0
> default_duration                  INFINITY
> 
> [root@chead ~]# qconf -sconf
> #global:
> execd_spool_dir              /opt/sge/default/spool
> mailer                       /bin/mail
> xterm                        /usr/bin/X11/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 sh,bash,ksh,csh,tcsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           02:00:00
> loglevel                     log_warning
> administrator_mail           none
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 ENABLE_BINDING=true
> reporting_params             accounting=true reporting=true \
>                              flush_time=00:00:15 joblog=true sharelog=00:00:00
> finished_jobs                100
> gid_range                    20000-20100
> qlogin_command               /opt/sge/bin/cfn-qlogin.sh
> qlogin_daemon                /usr/sbin/sshd -i
> rlogin_command               builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                   builtin
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   4000
> max_jobs                     0
> max_advance_reservations     0
> auto_user_oticket            0
> auto_user_fshare             100
> auto_user_default_project    none
> auto_user_delete_time        0
> delegated_file_staging       false
> reprioritize                 0
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
> 
> [root@chead ~]# qconf -sq all.q
> qname                 all.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH
> ckpt_list             NONE
> pe_list               make mpich mpi orte unihost serial int_test unihost2
> rerun                 FALSE
> slots                 1,[compute-0-0.local=4],[compute-0-1.local=15], \
>                       [compute-0-2.local=15],[compute-0-3.local=15], \
>                       [compute-0-4.local=15],[compute-0-5.local=15], \
>                       [compute-0-6.local=16],[compute-0-7.local=16], \
>                       [compute-0-9.local=16],[compute-0-10.local=16], \
>                       [compute-0-11.local=16],[compute-0-12.local=16], \
>                       [compute-0-13.local=16],[compute-0-14.local=16], \
>                       [compute-0-15.local=16],[compute-0-16.local=16], \
>                       [compute-0-17.local=16],[compute-0-18.local=16], \
>                       [compute-0-8.local=16],[compute-0-19.local=14], \
>                       [compute-0-20.local=4],[compute-gpu-0.local=4]
> tmpdir                /tmp
> shell                 /bin/bash
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                40G,[compute-0-20.local=3.2G], \
>                       [compute-gpu-0.local=3.2G],[compute-0-19.local=5G]
> h_vmem                40G,[compute-0-20.local=3.2G], \
>                       [compute-gpu-0.local=3.2G],[compute-0-19.local=5G]
> 
> qstat -j on a stuck job as an example:
> 
> [mgstauff@chead ~]$ qstat -j 3714924
> ==============================================================
> job_number:                 3714924
> exec_file:                  job_scripts/3714924
> submission_time:            Fri Aug 11 12:48:47 2017
> owner:                      mgstauff
> uid:                        2198
> group:                      mgstauff
> gid:                        2198
> sge_o_home:                 /home/mgstauff
> sge_o_log_name:             mgstauff
> sge_o_path:                 
> /share/apps/mricron/ver_2015_06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr!
 
/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /home/mgstauff
> sge_o_host:                 chead
> account:                    sge
> hard resource_list:         h_stack=128m
> mail_list:                  mgstauff@chead.local
> notify:                     FALSE
> job_name:                   myjobparam
> jobshare:                   0
> hard_queue_list:            all.q
> env_list:                   TERM=NONE
> job_args:                   5
> script_file:                workshop-files/myjobparam
> parallel environment:  int_test range: 2
> binding:                    set linear:2
> job_type:                   NONE
> scheduling info:            queue instance "gpu.q@compute-gpu-0.local" 
> dropped because it is temporarily not available
>                             queue instance "qlogin.gpu.q@compute-gpu-0.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-18.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-17.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-16.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-13.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-15.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-14.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-12.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-11.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-10.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-9.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-5.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-6.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-7.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-8.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-4.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-2.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-1.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-0.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-20.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-19.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-0-3.local" 
> dropped because it is temporarily not available
>                             queue instance "reboot.q@compute-gpu-0.local" 
> dropped because it is temporarily not available
>                             queue instance "qlogin.long.q@compute-0-20.local" 
> dropped because it is full
>                             queue instance "qlogin.long.q@compute-0-19.local" 
> dropped because it is full
>                             queue instance 
> "qlogin.long.q@compute-gpu-0.local" dropped because it is full
>                             queue instance "basic.q@compute-1-2.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-13.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-4.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-2.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-12.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-17.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-3.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-8.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-5.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-11.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-15.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-7.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-14.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-18.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-10.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-6.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-gpu-0.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-16.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-9.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-0.local" 
> dropped because it is full
>                             queue instance "himem.q@compute-0-1.local" 
> dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-13.local" dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-4.local" 
> dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-2.local" 
> dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-12.local" dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-17.local" dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-3.local" 
> dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-8.local" 
> dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-5.local" 
> dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-11.local" dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-15.local" dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-7.local" 
> dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-14.local" dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-18.local" dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-10.local" dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-6.local" 
> dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-gpu-0.local" dropped because it is full
>                             queue instance 
> "qlogin.himem.q@compute-0-16.local" dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-9.local" 
> dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-0.local" 
> dropped because it is full
>                             queue instance "qlogin.himem.q@compute-0-1.local" 
> dropped because it is full
>                             queue instance "qlogin.q@compute-0-20.local" 
> dropped because it is full
>                             queue instance "qlogin.q@compute-0-19.local" 
> dropped because it is full
>                             queue instance "qlogin.q@compute-gpu-0.local" 
> dropped because it is full
>                             queue instance "qlogin.q@compute-0-7.local" 
> dropped because it is full
>                             queue instance "all.q@compute-0-0.local" dropped 
> because it is full
>                             cannot run in PE "int_test" because it only 
> offers 0 slots
> 
> [mgstauff@chead ~]$ qquota -u mgstauff
> resource quota rule limit                filter
> --------------------------------------------------------------------------------
> 
> [mgstauff@chead ~]$ qconf -srqs limit_user_slots
> {
>    name         limit_user_slots
>    description  Limit the users' batch slots
>    enabled      TRUE
>    limit        users {pcook,mgstauff} queues {allalt.q} to slots=32
>    limit        users {*} queues {allalt.q} to slots=0
>    limit        users {*} queues {himem.q} to slots=6
>    limit        users {*} queues {all.q,himem.q} to slots=32
>    limit        users {*} queues {basic.q} to slots=40
> }
> 
> There are plenty of consumables available:
> 
> [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a
> queuename                      qtype resv/used/tot. load_avg arch          
> states
> ---------------------------------------------------------------------------------
> all.q@compute-0-0.local        BP    0/4/4          5.24     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         qc:slots=0
> ---------------------------------------------------------------------------------
> all.q@compute-0-1.local        BP    0/10/15        9.58     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         qc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-10.local       BP    0/9/16         9.80     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=7
> ---------------------------------------------------------------------------------
> all.q@compute-0-11.local       BP    0/11/16        9.18     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-12.local       BP    0/11/16        9.72     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=6
> ---------------------------------------------------------------------------------
> all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
>         hc:h_vmem=28.890G
>         hc:s_vmem=30.990G
>         hc:slots=6
> ---------------------------------------------------------------------------------
> all.q@compute-0-15.local       BP    0/10/16        9.54     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=6
> ---------------------------------------------------------------------------------
> all.q@compute-0-16.local       BP    0/10/16        10.01    lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=6
> ---------------------------------------------------------------------------------
> all.q@compute-0-17.local       BP    0/11/16        9.75     lx-amd64
>         hc:h_vmem=29.963G
>         hc:s_vmem=32.960G
>         hc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-18.local       BP    0/11/16        10.29    lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-19.local       BP    0/9/14         9.01     lx-amd64
>         qf:h_vmem=5.000G
>         qf:s_vmem=5.000G
>         qc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-2.local        BP    0/10/15        9.24     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         qc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-20.local       BP    0/0/4          0.00     lx-amd64
>         qf:h_vmem=3.200G
>         qf:s_vmem=3.200G
>         qc:slots=4
> ---------------------------------------------------------------------------------
> all.q@compute-0-3.local        BP    0/11/15        9.62     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         qc:slots=4
> ---------------------------------------------------------------------------------
> all.q@compute-0-4.local        BP    0/12/15        9.85     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         qc:slots=3
> ---------------------------------------------------------------------------------
> all.q@compute-0-5.local        BP    0/12/15        10.18    lx-amd64
>         hc:h_vmem=36.490G
>         hc:s_vmem=39.390G
>         qc:slots=3
> ---------------------------------------------------------------------------------
> all.q@compute-0-6.local        BP    0/12/16        9.95     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=4
> ---------------------------------------------------------------------------------
> all.q@compute-0-7.local        BP    0/10/16        9.59     lx-amd64
>         hc:h_vmem=36.935G
>         qf:s_vmem=40.000G
>         hc:slots=5
> ---------------------------------------------------------------------------------
> all.q@compute-0-8.local        BP    0/10/16        9.37     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=6
> ---------------------------------------------------------------------------------
> all.q@compute-0-9.local        BP    0/10/16        9.38     lx-amd64
>         qf:h_vmem=40.000G
>         qf:s_vmem=40.000G
>         hc:slots=6
> ---------------------------------------------------------------------------------
> all.q@compute-gpu-0.local      BP    0/0/4          0.05     lx-amd64
>         qf:h_vmem=3.200G
>         qf:s_vmem=3.200G
>         qc:slots=4
> 
> 
> On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov> wrote:
> On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote:
> SoGE 8.1.8
> 
> Hi,
> 
> I'm getting some queued jobs with scheduling info that includes this line
> at the end:
> 
> cannot run in PE "unihost" because it only offers 0 slots
> 
> 'unihost' is the only PE I use. When users request multiple slots, they use
> 'unihost':
> 
> ... -binding linear:2 -pe unihost 2 ...
> 
> What happens is that these jobs aren't running when it otherwise seems like
> they should be, or they sit waiting in the queue for a long time even when
> the user has plenty of quota available within the queue they've requested,
> and there are enough resources available on the queue's nodes (slots and
> vram are consumables).
> 
> Any suggestions about how I might further understand this?
> 
> This *exact* problem has bitten me in the past.  It seems to crop up
> about every 3 years--long enough to remember it was a problem, and long
> enough to forget just what the [censored] I did to fix it.
> 
> As I recall, it has little to do with actual PEs, but everything to do
> with complexes and resource requests.
> 
> You might glean a bit more information by running "qsub -w p" (or "-w e").
> 
> Take a look at these previous discussions:
> 
> http://gridengine.org/pipermail/users/2011-November/001932.html
> http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700
> 
> 
> -- 
> Jesse Becker (Contractor)
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] PE offers 0 slots?

Reply via email to