Hi, Am 12.08.2017 um 00:41 schrieb Michael Stauffer:
> Hi, > > I'm getting back to this post finally. I've looked at the links and > suggestions in the two replies to my original post a few months ago, but they > haven't helped. Here's my original: > > I'm getting some queued jobs with scheduling info that includes this line at > the end: > > cannot run in PE "unihost" because it only offers 0 slots What I notice below: defining h_vmem/s_vmem on a queue level means per job. Defining it on an exechost level means across all jobs. What is different between: > --------------------------------------------------------------------------------- > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=6 > --------------------------------------------------------------------------------- > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 > hc:h_vmem=28.890G > hc:s_vmem=30.990G > hc:slots=6 qf = queue fixed hc = host consumable What is the definition of h_vmem/s_vmem in `qconf -sc` and their default consumptions? > 'unihost' is the only PE I use. When users request multiple slots, they use > 'unihost': > > qsub ... -binding linear:2 -pe unihost 2 ... > > What happens is that these jobs aren't running when it otherwise seems like > they should be, or they sit waiting in the queue for a long time even when > the user has plenty of quota available within the queue they've requested, > and there are enough resources available on the queue's nodes per qhost(slots > and vmem are consumables), and qquota isn't showing any rqs limits have been > reached. > > Below I've dumped relevant configurations. > > Today I created a new PE called "int_test" to test the "integer" allocation > rule. I set it to 16 (16 cores per node), and have also tried 8. It's been > added as a PE to the queues we use. When I try to run to this new PE however, > it *always* fails with the same "PE ...offers 0 slots" error, even if I can > run the same multi-slot job using "unihost" PE at the same time. I'm not sure > if this helps debug or not. > > Another thought - this behavior started happening some time ago more or less > when I tried implementing fairshare behavior. I never seemed to get fairshare > working right. We haven't been able to confirm, but for some users it seems > this "PE 0 slots" issue pops up only after they've been running other jobs > for a little while. So I'm wondering if I've screwed up fairshare in some way > that's causing this odd behavior. > > The default queue from global config file is all.q. There is no default queue in SGE. One specifies resource requests and SGE will select an appropriate one. What do you refer to by this? Do you have any sge_request or private .sge_request? -- Reuti > > Here are various config dumps. Is there anything else that might be helpful? > > Thanks for any help! This has been plaguing me. > > > [root@chead ~]# qconf -sp unihost > pe_name unihost > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $pe_slots > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min > accounting_summary FALSE > qsort_args NONE > > [root@chead ~]# qconf -sp int_test > pe_name int_test > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule 8 > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min > accounting_summary FALSE > qsort_args NONE > > [root@chead ~]# qconf -ssconf > algorithm default > schedule_interval 0:0:5 > maxujobs 200 > queue_sort_method load > job_load_adjustments np_load_avg=0.50 > load_adjustment_decay_time 0:7:30 > load_formula np_load_avg > schedd_job_info true > flush_submit_sec 0 > flush_finish_sec 0 > params none > reprioritize_interval 0:0:0 > halftime 1 > usage_weight_list cpu=0.700000,mem=0.200000,io=0.100000 > compensation_factor 5.000000 > weight_user 0.250000 > weight_project 0.250000 > weight_department 0.250000 > weight_job 0.250000 > weight_tickets_functional 1000 > weight_tickets_share 100000 > share_override_tickets TRUE > share_functional_shares TRUE > max_functional_jobs_to_schedule 2000 > report_pjob_tickets TRUE > max_pending_tasks_per_job 100 > halflife_decay_list none > policy_hierarchy OS > weight_ticket 0.000000 > weight_waiting_time 1.000000 > weight_deadline 3600000.000000 > weight_urgency 0.100000 > weight_priority 1.000000 > max_reservation 0 > default_duration INFINITY > > [root@chead ~]# qconf -sconf > #global: > execd_spool_dir /opt/sge/default/spool > mailer /bin/mail > xterm /usr/bin/X11/xterm > load_sensor none > prolog none > epilog none > shell_start_mode posix_compliant > login_shells sh,bash,ksh,csh,tcsh > min_uid 0 > min_gid 0 > user_lists none > xuser_lists none > projects none > xprojects none > enforce_project false > enforce_user auto > load_report_time 00:00:40 > max_unheard 00:05:00 > reschedule_unknown 02:00:00 > loglevel log_warning > administrator_mail none > set_token_cmd none > pag_cmd none > token_extend_time none > shepherd_cmd none > qmaster_params none > execd_params ENABLE_BINDING=true > reporting_params accounting=true reporting=true \ > flush_time=00:00:15 joblog=true sharelog=00:00:00 > finished_jobs 100 > gid_range 20000-20100 > qlogin_command /opt/sge/bin/cfn-qlogin.sh > qlogin_daemon /usr/sbin/sshd -i > rlogin_command builtin > rlogin_daemon builtin > rsh_command builtin > rsh_daemon builtin > max_aj_instances 2000 > max_aj_tasks 75000 > max_u_jobs 4000 > max_jobs 0 > max_advance_reservations 0 > auto_user_oticket 0 > auto_user_fshare 100 > auto_user_default_project none > auto_user_delete_time 0 > delegated_file_staging false > reprioritize 0 > jsv_url none > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w > > [root@chead ~]# qconf -sq all.q > qname all.q > hostlist @allhosts > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH > ckpt_list NONE > pe_list make mpich mpi orte unihost serial int_test unihost2 > rerun FALSE > slots 1,[compute-0-0.local=4],[compute-0-1.local=15], \ > [compute-0-2.local=15],[compute-0-3.local=15], \ > [compute-0-4.local=15],[compute-0-5.local=15], \ > [compute-0-6.local=16],[compute-0-7.local=16], \ > [compute-0-9.local=16],[compute-0-10.local=16], \ > [compute-0-11.local=16],[compute-0-12.local=16], \ > [compute-0-13.local=16],[compute-0-14.local=16], \ > [compute-0-15.local=16],[compute-0-16.local=16], \ > [compute-0-17.local=16],[compute-0-18.local=16], \ > [compute-0-8.local=16],[compute-0-19.local=14], \ > [compute-0-20.local=4],[compute-gpu-0.local=4] > tmpdir /tmp > shell /bin/bash > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists NONE > xuser_lists NONE > subordinate_list NONE > complex_values NONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem 40G,[compute-0-20.local=3.2G], \ > [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] > h_vmem 40G,[compute-0-20.local=3.2G], \ > [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] > > qstat -j on a stuck job as an example: > > [mgstauff@chead ~]$ qstat -j 3714924 > ============================================================== > job_number: 3714924 > exec_file: job_scripts/3714924 > submission_time: Fri Aug 11 12:48:47 2017 > owner: mgstauff > uid: 2198 > group: mgstauff > gid: 2198 > sge_o_home: /home/mgstauff > sge_o_log_name: mgstauff > sge_o_path: > /share/apps/mricron/ver_2015_06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr! /java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin > sge_o_shell: /bin/bash > sge_o_workdir: /home/mgstauff > sge_o_host: chead > account: sge > hard resource_list: h_stack=128m > mail_list: mgstauff@chead.local > notify: FALSE > job_name: myjobparam > jobshare: 0 > hard_queue_list: all.q > env_list: TERM=NONE > job_args: 5 > script_file: workshop-files/myjobparam > parallel environment: int_test range: 2 > binding: set linear:2 > job_type: NONE > scheduling info: queue instance "gpu.q@compute-gpu-0.local" > dropped because it is temporarily not available > queue instance "qlogin.gpu.q@compute-gpu-0.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-18.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-17.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-16.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-13.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-15.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-14.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-12.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-11.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-10.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-9.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-5.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-6.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-7.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-8.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-4.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-2.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-1.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-0.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-20.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-19.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-0-3.local" > dropped because it is temporarily not available > queue instance "reboot.q@compute-gpu-0.local" > dropped because it is temporarily not available > queue instance "qlogin.long.q@compute-0-20.local" > dropped because it is full > queue instance "qlogin.long.q@compute-0-19.local" > dropped because it is full > queue instance > "qlogin.long.q@compute-gpu-0.local" dropped because it is full > queue instance "basic.q@compute-1-2.local" > dropped because it is full > queue instance "himem.q@compute-0-13.local" > dropped because it is full > queue instance "himem.q@compute-0-4.local" > dropped because it is full > queue instance "himem.q@compute-0-2.local" > dropped because it is full > queue instance "himem.q@compute-0-12.local" > dropped because it is full > queue instance "himem.q@compute-0-17.local" > dropped because it is full > queue instance "himem.q@compute-0-3.local" > dropped because it is full > queue instance "himem.q@compute-0-8.local" > dropped because it is full > queue instance "himem.q@compute-0-5.local" > dropped because it is full > queue instance "himem.q@compute-0-11.local" > dropped because it is full > queue instance "himem.q@compute-0-15.local" > dropped because it is full > queue instance "himem.q@compute-0-7.local" > dropped because it is full > queue instance "himem.q@compute-0-14.local" > dropped because it is full > queue instance "himem.q@compute-0-18.local" > dropped because it is full > queue instance "himem.q@compute-0-10.local" > dropped because it is full > queue instance "himem.q@compute-0-6.local" > dropped because it is full > queue instance "himem.q@compute-gpu-0.local" > dropped because it is full > queue instance "himem.q@compute-0-16.local" > dropped because it is full > queue instance "himem.q@compute-0-9.local" > dropped because it is full > queue instance "himem.q@compute-0-0.local" > dropped because it is full > queue instance "himem.q@compute-0-1.local" > dropped because it is full > queue instance > "qlogin.himem.q@compute-0-13.local" dropped because it is full > queue instance "qlogin.himem.q@compute-0-4.local" > dropped because it is full > queue instance "qlogin.himem.q@compute-0-2.local" > dropped because it is full > queue instance > "qlogin.himem.q@compute-0-12.local" dropped because it is full > queue instance > "qlogin.himem.q@compute-0-17.local" dropped because it is full > queue instance "qlogin.himem.q@compute-0-3.local" > dropped because it is full > queue instance "qlogin.himem.q@compute-0-8.local" > dropped because it is full > queue instance "qlogin.himem.q@compute-0-5.local" > dropped because it is full > queue instance > "qlogin.himem.q@compute-0-11.local" dropped because it is full > queue instance > "qlogin.himem.q@compute-0-15.local" dropped because it is full > queue instance "qlogin.himem.q@compute-0-7.local" > dropped because it is full > queue instance > "qlogin.himem.q@compute-0-14.local" dropped because it is full > queue instance > "qlogin.himem.q@compute-0-18.local" dropped because it is full > queue instance > "qlogin.himem.q@compute-0-10.local" dropped because it is full > queue instance "qlogin.himem.q@compute-0-6.local" > dropped because it is full > queue instance > "qlogin.himem.q@compute-gpu-0.local" dropped because it is full > queue instance > "qlogin.himem.q@compute-0-16.local" dropped because it is full > queue instance "qlogin.himem.q@compute-0-9.local" > dropped because it is full > queue instance "qlogin.himem.q@compute-0-0.local" > dropped because it is full > queue instance "qlogin.himem.q@compute-0-1.local" > dropped because it is full > queue instance "qlogin.q@compute-0-20.local" > dropped because it is full > queue instance "qlogin.q@compute-0-19.local" > dropped because it is full > queue instance "qlogin.q@compute-gpu-0.local" > dropped because it is full > queue instance "qlogin.q@compute-0-7.local" > dropped because it is full > queue instance "all.q@compute-0-0.local" dropped > because it is full > cannot run in PE "int_test" because it only > offers 0 slots > > [mgstauff@chead ~]$ qquota -u mgstauff > resource quota rule limit filter > -------------------------------------------------------------------------------- > > [mgstauff@chead ~]$ qconf -srqs limit_user_slots > { > name limit_user_slots > description Limit the users' batch slots > enabled TRUE > limit users {pcook,mgstauff} queues {allalt.q} to slots=32 > limit users {*} queues {allalt.q} to slots=0 > limit users {*} queues {himem.q} to slots=6 > limit users {*} queues {all.q,himem.q} to slots=32 > limit users {*} queues {basic.q} to slots=40 > } > > There are plenty of consumables available: > > [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a > queuename qtype resv/used/tot. load_avg arch > states > --------------------------------------------------------------------------------- > all.q@compute-0-0.local BP 0/4/4 5.24 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > qc:slots=0 > --------------------------------------------------------------------------------- > all.q@compute-0-1.local BP 0/10/15 9.58 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > qc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-10.local BP 0/9/16 9.80 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=7 > --------------------------------------------------------------------------------- > all.q@compute-0-11.local BP 0/11/16 9.18 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-12.local BP 0/11/16 9.72 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=6 > --------------------------------------------------------------------------------- > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 > hc:h_vmem=28.890G > hc:s_vmem=30.990G > hc:slots=6 > --------------------------------------------------------------------------------- > all.q@compute-0-15.local BP 0/10/16 9.54 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=6 > --------------------------------------------------------------------------------- > all.q@compute-0-16.local BP 0/10/16 10.01 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=6 > --------------------------------------------------------------------------------- > all.q@compute-0-17.local BP 0/11/16 9.75 lx-amd64 > hc:h_vmem=29.963G > hc:s_vmem=32.960G > hc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-18.local BP 0/11/16 10.29 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-19.local BP 0/9/14 9.01 lx-amd64 > qf:h_vmem=5.000G > qf:s_vmem=5.000G > qc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-2.local BP 0/10/15 9.24 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > qc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-20.local BP 0/0/4 0.00 lx-amd64 > qf:h_vmem=3.200G > qf:s_vmem=3.200G > qc:slots=4 > --------------------------------------------------------------------------------- > all.q@compute-0-3.local BP 0/11/15 9.62 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > qc:slots=4 > --------------------------------------------------------------------------------- > all.q@compute-0-4.local BP 0/12/15 9.85 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > qc:slots=3 > --------------------------------------------------------------------------------- > all.q@compute-0-5.local BP 0/12/15 10.18 lx-amd64 > hc:h_vmem=36.490G > hc:s_vmem=39.390G > qc:slots=3 > --------------------------------------------------------------------------------- > all.q@compute-0-6.local BP 0/12/16 9.95 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=4 > --------------------------------------------------------------------------------- > all.q@compute-0-7.local BP 0/10/16 9.59 lx-amd64 > hc:h_vmem=36.935G > qf:s_vmem=40.000G > hc:slots=5 > --------------------------------------------------------------------------------- > all.q@compute-0-8.local BP 0/10/16 9.37 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=6 > --------------------------------------------------------------------------------- > all.q@compute-0-9.local BP 0/10/16 9.38 lx-amd64 > qf:h_vmem=40.000G > qf:s_vmem=40.000G > hc:slots=6 > --------------------------------------------------------------------------------- > all.q@compute-gpu-0.local BP 0/0/4 0.05 lx-amd64 > qf:h_vmem=3.200G > qf:s_vmem=3.200G > qc:slots=4 > > > On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov> wrote: > On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote: > SoGE 8.1.8 > > Hi, > > I'm getting some queued jobs with scheduling info that includes this line > at the end: > > cannot run in PE "unihost" because it only offers 0 slots > > 'unihost' is the only PE I use. When users request multiple slots, they use > 'unihost': > > ... -binding linear:2 -pe unihost 2 ... > > What happens is that these jobs aren't running when it otherwise seems like > they should be, or they sit waiting in the queue for a long time even when > the user has plenty of quota available within the queue they've requested, > and there are enough resources available on the queue's nodes (slots and > vram are consumables). > > Any suggestions about how I might further understand this? > > This *exact* problem has bitten me in the past. It seems to crop up > about every 3 years--long enough to remember it was a problem, and long > enough to forget just what the [censored] I did to fix it. > > As I recall, it has little to do with actual PEs, but everything to do > with complexes and resource requests. > > You might glean a bit more information by running "qsub -w p" (or "-w e"). > > Take a look at these previous discussions: > > http://gridengine.org/pipermail/users/2011-November/001932.html > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700 > > > -- > Jesse Becker (Contractor) > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users