Have you checked the status of the queue instances? Sometime if a queue instance goes into error status, it can not run jobs like this.
qstat -F can list the status. And qmod -c (queue instatnce) can clear it. On Mon, Dec 12, 2016 at 12:35 AM, Coleman, Marcus [JRDUS Non-J&J] < mcole...@its.jnj.com> wrote: > Hi > > I am sure this is your problem....You are submitting a job that requires 2 > cores, to a queue that has only 1 slot available. > If your host all have the same amount of cores...it is no reason to > separate them via commons. This is only needed if the host have different > amount of slots or you want to manipulate the slots... > > slots 1,[ibm021=8],[ibm037=8],[ibm038=8] > slots 8 > > > I would only list the pe I am using I am requesting...unless you plan to > use each of those PE's > pe_list make mpi smp cores > pe_list cores > > > Also you mentioned parallel env, I WOULD change allocation to $fill_up > unless your software (not sge) control jobs distribution.. > > qconf -sp core > allocation_rule $pe_slots <---( ONLY USE ONE MACHINE) > control_slaves FALSE <--- (I think you want tight integration) > job_is_first_task TRUE <----( this is true if the first job submitted > only kicks off other jobs) > > allocation_rule $fill_up <---work better for parallel jobs > control_slaves TRUE < ---- you want tight integration with SGE > job_is_first_task <----can go either way, unless you are sure your > software will control job distro... > > > Also what does qmaster message and the associated node sge message say... > > > > > > > > > -----Original Message----- > From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] > On Behalf Of users-requ...@gridengine.org > Sent: Sunday, December 11, 2016 9:05 PM > To: users@gridengine.org > Subject: [EXTERNAL] users Digest, Vol 72, Issue 13 > > Send users mailing list submissions to > users@gridengine.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://gridengine.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@gridengine.org > > You can reach the person managing the list at > users-ow...@gridengine.org > > When replying, please edit your Subject line so it is more specific than > "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: CPU complex (John_Tai) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 12 Dec 2016 05:04:33 +0000 > From: John_Tai <john_...@smics.com> > To: Christopher Heiny <christopherhe...@gmail.com> > Cc: "users@gridengine.org" <users@gridengine.org> > Subject: Re: [gridengine users] CPU complex > Message-ID: <EB25FF8EBBD4BC478EF05F2F4C436479021D2BA5A2@shex-d02> > Content-Type: text/plain; charset="utf-8" > > # qconf -sq all.q > qname all.q > hostlist @allhosts > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list make mpi smp cores > rerun FALSE > slots 1,[ibm021=8],[ibm037=8],[ibm038=8] > tmpdir /tmp > shell /bin/sh > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists NONE > xuser_lists NONE > subordinate_list NONE > complex_values NONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > > > From: Christopher Heiny [mailto:christopherhe...@gmail.com] > Sent: Monday, December 12, 2016 12:22 > To: John_Tai > Cc: users@gridengine.org; Reuti > Subject: Re: [gridengine users] CPU complex > > > > On Dec 11, 2016 5:11 PM, "John_Tai" <john_...@smics.com<mailto:Joh > n_...@smics.com>> wrote: > I associated the queue with the PE: > > qconf -aattr queue pe_list cores all.q The only slots were defined > in the all.q queue, and just the total slots in the PE: > > >> # qconf -sp cores > >> pe_name cores > >> slots 999 > >> user_lists NONE > >> xuser_lists NONE > Do I need to define slots in another way for each exec host? Is there a > way to check the current free slots for a host, other than the qstat -f > below? > > > # qstat -f > > queuename qtype resv/used/tot. load_avg arch > states > > ------------------------------------------------------------ > --------------------- > > all.q@ibm021<mailto:all.q@ibm021> BIP 0/0/8 > 0.02 lx-amd64 > > ------------------------------------------------------------ > --------------------- > > all.q@ibm037<mailto:all.q@ibm037> BIP 0/0/8 > 0.00 lx-amd64 > > ------------------------------------------------------------ > --------------------- > > all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8 > 0.00 lx-amd64 > > What is the output of the command > qconf -sq all.q > ? (I think that's right one) > > Chris > > > > > > > -----Original Message----- > From: Reuti [mailto:re...@staff.uni-marburg.de<mailto:reuti@staff. > uni-marburg.de>] > Sent: Saturday, December 10, 2016 5:40 > To: John_Tai > Cc: users@gridengine.org<mailto:users@gridengine.org> > Subject: Re: [gridengine users] CPU complex > > Am 09.12.2016 um 10:36 schrieb John_Tai: > > > 8 slots: > > > > # qstat -f > > queuename qtype resv/used/tot. load_avg arch > states > > ------------------------------------------------------------ > --------------------- > > all.q@ibm021<mailto:all.q@ibm021> BIP 0/0/8 > 0.02 lx-amd64 > > ------------------------------------------------------------ > --------------------- > > all.q@ibm037<mailto:all.q@ibm037> BIP 0/0/8 > 0.00 lx-amd64 > > ------------------------------------------------------------ > --------------------- > > all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8 > 0.00 lx-amd64 > > ------------------------------------------------------------ > --------------------- > > pc.q@ibm021<mailto:pc.q@ibm021> BIP 0/0/1 > 0.02 lx-amd64 > > ------------------------------------------------------------ > --------------------- > > sim.q@ibm021<mailto:sim.q@ibm021> BIP 0/0/1 > 0.02 lx-amd64 > > Is there any limit of slots in the exechost defined, or in an RQS? > > -- Reuti > > > > > > ###################################################################### > > ###### > > - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING > > JOBS ############################################################ > ################ > > 89 0.55500 xclock johnt qw 12/09/2016 15:14:25 2 > > > > > > > > -----Original Message----- > > From: Reuti > > [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de>] > > Sent: Friday, December 09, 2016 3:46 > > To: John_Tai > > Cc: users@gridengine.org<mailto:users@gridengine.org> > > Subject: Re: [gridengine users] CPU complex > > > > Hi, > > > > Am 09.12.2016 um 08:20 schrieb John_Tai: > > > >> I've setup PE but I'm having problems submitting jobs. > >> > >> - Here's the PE I created: > >> > >> # qconf -sp cores > >> pe_name cores > >> slots 999 > >> user_lists NONE > >> xuser_lists NONE > >> start_proc_args /bin/true > >> stop_proc_args /bin/true > >> allocation_rule $pe_slots > >> control_slaves FALSE > >> job_is_first_task TRUE > >> urgency_slots min > >> accounting_summary FALSE > >> qsort_args NONE > >> > >> - I've then added this to all.q: > >> > >> qconf -aattr queue pe_list cores all.q > > > > How many "slots" were defined in there queue definition for all.q? > > > > -- Reuti > > > > > >> - Now I submit a job: > >> > >> # qsub -V -b y -cwd -now n -pe cores 2 -q > >> all.q@ibm038<mailto:all.q@ibm038> xclock Your job > >> 89 ("xclock") has been submitted # qstat > >> job-ID prior name user state submit/start at > queue slots ja-task-ID > >> ------------------------------------------------------------ > ----------------------------------------------------- > >> 89 0.00000 xclock johnt qw 12/09/2016 15:14:25 > 2 > >> # qalter -w p 89 > >> Job 89 cannot run in PE "cores" because it only offers 0 slots > >> verification: no suitable queues > >> # qstat -f > >> queuename qtype resv/used/tot. load_avg arch > states > >> ------------------------------------------------------------ > --------------------- > >> all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8 > 0.00 lx-amd64 > >> > >> ##################################################################### > >> # > >> ###### > >> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING > >> JOBS ############################################################ > ################ > >> 89 0.55500 xclock johnt qw 12/09/2016 15:14:25 2 > >> > >> > >> ---------------------------------------------------- > >> > >> It looks like all.q@ibm038<mailto:all.q@ibm038> should have 8 free > slots, so why is it only offering 0? > >> > >> Hope you can help me. > >> Thanks > >> John > >> > >> > >> > >> > >> > >> > >> -----Original Message----- > >> From: Reuti > >> [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de> > >> ] > >> Sent: Monday, December 05, 2016 6:32 > >> To: John_Tai > >> Cc: users@gridengine.org<mailto:users@gridengine.org> > >> Subject: Re: [gridengine users] CPU complex > >> > >> Hi, > >> > >>> Am 05.12.2016 um 09:36 schrieb John_Tai <john_...@smics.com<mailto:Joh > n_...@smics.com>>: > >>> > >>> Thank you so much for your reply! > >>> > >>>>> Will you use the consumable virtual_free here instead mem? > >>> > >>> Yes I meant to write virtual_free, not mem. Apologies. > >>> > >>>>> For parallel jobs you need to configure a (or some) so called PE > (Parallel Environment). > >>> > >>> My jobs are actually just one process which uses multiple cores, so > for example in top one process "simv" is currently using 2 cpu cores (200%). > >> > >> Yes, then it's a parallel job for SGE. Although the entries for > start_proc_args resp. stop_proc_args can be left untouched to the default, > a PE is the paradigm in SGE for a parallel job. > >> > >> > >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >>> 3017 kelly 20 0 3353m 3.0g 165m R 200.0 0.6 15645:46 simv > >>> > >>> So I'm not sure PE is suitable for my case, since it is not multiple > parallel processes running at the same time. Am I correct? > >>> > >>> If so, I am trying to find a way to get SGE to keep track of the > number of cores used, but I believe it only keeps track of the total CPU > usage in %. I guess I could use this and and the <total num cores> to get > the <num of cores in use>, but how to integrate it in SGE? > >> > >> You can specify a necessary number of cores for your job in the -pe > parameter, which can also be a range. The granted allocation by SGE you can > check in the job script $NHOSTS, $NSLOTS, $PE_HOSTFILE. > >> > >> Having this setup, SGE will track the number of used cores per machine. > The available ones you define in the queue definition. In case you have > more than one queue per exechost, we need to setup in addition an overall > limit of cores which can be used at the same time to avoid oversubscription. > >> > >> -- Reuti > >> > >>> Thank you again for your help. > >>> > >>> John > >>> > >>> -----Original Message----- > >>> From: Reuti > >>> [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de > >>> >] > >>> Sent: Monday, December 05, 2016 4:21 > >>> To: John_Tai > >>> Cc: users@gridengine.org<mailto:users@gridengine.org> > >>> Subject: Re: [gridengine users] CPU complex > >>> > >>> Hi, > >>> > >>> Am 05.12.2016 um 08:00 schrieb John_Tai: > >>> > >>>> Newbie here, hope to understand SGE usage. > >>>> > >>>> I've successfully configured virtual_free as a complex for telling > SGE how much memory is needed when submitting a job, as described here: > >>>> > >>>> https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html# > >>>> <https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html > >>>> > > >>>> i > >>>> 1000029 > >>>> > >>>> How do I do the same for telling SGE how many CPU cores a job needs? > For example: > >>>> > >>>> qsub -l mem=24G,cpu=4 myjob > >>> > >>> Will you use the consumable virtual_free here instead mem? > >>> > >>> > >>>> Obviously I'd need for SGE to keep track of the actual CPU > utilization in the host, just as virtual_free is being tracked > independently of the SGE jobs. > >>> > >>> For parallel jobs you need to configure a (or some) so called PE > (Parallel Environment). Purpose of this is, to make preparations for the > parallel jobs like rearranging the list of granted slots, prepare shared > directories between the nodes,... > >>> > >>> These PEs were of higher importance in former times, when parallel > libraries were not programmed to integrate automatically in SGE for a tight > integration. Your submissions could read: > >>> > >>> qsub -pe smp 4 myjob # allocation_rule $peslots, > control_slaves true > >>> qsub -pe orte 16 myjob # allovation_rule $round_robin, > control_slaves tue > >>> > >>> where smp resp. orte is the chosen parallel environment for OpenMP > resp. Open MPI. Its settings are explained in `man sge_pe`, the "-pe" > parameter to in the submission command in `man qsub`. > >>> > >>> -- Reuti > >>> ________________________________ > >>> > >>> This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please > notify the sender immediately and delete this email from your computer. > >>> > >> > >> ________________________________ > >> > >> This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please > notify the sender immediately and delete this email from your computer. > >> > > > > ________________________________ > > > > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please > notify the sender immediately and delete this email from your computer. > > > > ________________________________ > > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please > notify the sender immediately and delete this email from your computer. > > _______________________________________________ > users mailing list > users@gridengine.org<mailto:users@gridengine.org> > https://gridengine.org/mailman/listinfo/users > > ________________________________ > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please > notify the sender immediately and delete this email from your computer. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://gridengine.org/pipermail/users/attachments/ > 20161212/5666d5d4/attachment.html> > > ------------------------------ > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > > > End of users Digest, Vol 72, Issue 13 > ************************************* > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > -- Best, Feng
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users