Re: [gridengine users] users Digest, Vol 72, Issue 13

Feng Zhang Tue, 13 Dec 2016 06:33:00 -0800

Have you checked the status of the queue instances? Sometime if a queue
instance goes into error status, it can not run jobs like this.


qstat -F

can list the status. And qmod -c (queue instatnce) can clear it.

On Mon, Dec 12, 2016 at 12:35 AM, Coleman, Marcus [JRDUS Non-J&J] <
mcole...@its.jnj.com> wrote:

> Hi
>
> I am sure this is your problem....You are submitting a job that requires 2
> cores, to a queue that has only 1 slot available.
> If your host all have the same amount of cores...it is no reason to
> separate them via commons. This is only needed if the host have different
> amount of slots or you want to manipulate  the slots...
>
> slots                 1,[ibm021=8],[ibm037=8],[ibm038=8]
> slots             8
>
>
> I would only list the pe I am using I am requesting...unless you plan to
> use each of those PE's
> pe_list               make mpi smp cores
> pe_list               cores
>
>
> Also you mentioned parallel env, I WOULD change allocation to $fill_up
> unless your software (not sge) control jobs distribution..
>
> qconf -sp core
>   allocation_rule    $pe_slots <---( ONLY USE ONE MACHINE)
>  control_slaves     FALSE <--- (I think you want tight integration)
>  job_is_first_task  TRUE  <----( this is true if the  first job submitted
> only kicks off other jobs)
>
> allocation_rule $fill_up <---work better for parallel jobs
> control_slaves TRUE < ---- you want tight integration with SGE
> job_is_first_task  <----can go either way, unless you are sure your
> software will control job distro...
>
>
> Also what does qmaster message and the associated node sge message say...
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org]
> On Behalf Of users-requ...@gridengine.org
> Sent: Sunday, December 11, 2016 9:05 PM
> To: users@gridengine.org
> Subject: [EXTERNAL] users Digest, Vol 72, Issue 13
>
> Send users mailing list submissions to
>         users@gridengine.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://gridengine.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@gridengine.org
>
> You can reach the person managing the list at
>         users-ow...@gridengine.org
>
> When replying, please edit your Subject line so it is more specific than
> "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>    1. Re: CPU complex (John_Tai)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 12 Dec 2016 05:04:33 +0000
> From: John_Tai <john_...@smics.com>
> To: Christopher Heiny <christopherhe...@gmail.com>
> Cc: "users@gridengine.org" <users@gridengine.org>
> Subject: Re: [gridengine users] CPU complex
> Message-ID: <EB25FF8EBBD4BC478EF05F2F4C436479021D2BA5A2@shex-d02>
> Content-Type: text/plain; charset="utf-8"
>
> # qconf -sq all.q
> qname                 all.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpi smp cores
> rerun                 FALSE
> slots                 1,[ibm021=8],[ibm037=8],[ibm038=8]
> tmpdir                /tmp
> shell                 /bin/sh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
>
>
>
> From: Christopher Heiny [mailto:christopherhe...@gmail.com]
> Sent: Monday, December 12, 2016 12:22
> To: John_Tai
> Cc: users@gridengine.org; Reuti
> Subject: Re: [gridengine users] CPU complex
>
>
>
> On Dec 11, 2016 5:11 PM, "John_Tai" <john_...@smics.com<mailto:Joh
> n_...@smics.com>> wrote:
> I associated the queue with the PE:
>
>         qconf -aattr queue pe_list cores all.q The only slots were defined
> in the all.q queue, and just the total slots in the PE:
>
> >> # qconf -sp cores
> >> pe_name            cores
> >> slots              999
> >> user_lists         NONE
> >> xuser_lists        NONE
> Do I need to define slots in another way for each exec host? Is there a
> way to check the current free slots for a host, other than the qstat -f
> below?
>
> > # qstat -f
> > queuename                      qtype resv/used/tot. load_avg arch
>   states
> > ------------------------------------------------------------
> ---------------------
> > all.q@ibm021<mailto:all.q@ibm021>                   BIP   0/0/8
>   0.02     lx-amd64
> > ------------------------------------------------------------
> ---------------------
> > all.q@ibm037<mailto:all.q@ibm037>                   BIP   0/0/8
>   0.00     lx-amd64
> > ------------------------------------------------------------
> ---------------------
> > all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8
>   0.00     lx-amd64
>
> What is the output of the command
>     qconf -sq all.q
> ? (I think that's right one)
>
> Chris
>
>
>
>
>
>
> -----Original Message-----
> From: Reuti [mailto:re...@staff.uni-marburg.de<mailto:reuti@staff.
> uni-marburg.de>]
> Sent: Saturday, December 10, 2016 5:40
> To: John_Tai
> Cc: users@gridengine.org<mailto:users@gridengine.org>
> Subject: Re: [gridengine users] CPU complex
>
> Am 09.12.2016 um 10:36 schrieb John_Tai:
>
> > 8 slots:
> >
> > # qstat -f
> > queuename                      qtype resv/used/tot. load_avg arch
>   states
> > ------------------------------------------------------------
> ---------------------
> > all.q@ibm021<mailto:all.q@ibm021>                   BIP   0/0/8
>   0.02     lx-amd64
> > ------------------------------------------------------------
> ---------------------
> > all.q@ibm037<mailto:all.q@ibm037>                   BIP   0/0/8
>   0.00     lx-amd64
> > ------------------------------------------------------------
> ---------------------
> > all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8
>   0.00     lx-amd64
> > ------------------------------------------------------------
> ---------------------
> > pc.q@ibm021<mailto:pc.q@ibm021>                    BIP   0/0/1
> 0.02     lx-amd64
> > ------------------------------------------------------------
> ---------------------
> > sim.q@ibm021<mailto:sim.q@ibm021>                   BIP   0/0/1
>   0.02     lx-amd64
>
> Is there any limit of slots in the exechost defined, or in an RQS?
>
> -- Reuti
>
>
> >
> > ######################################################################
> > ######
> > - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
> > JOBS ############################################################
> ################
> >     89 0.55500 xclock     johnt        qw    12/09/2016 15:14:25     2
> >
> >
> >
> > -----Original Message-----
> > From: Reuti
> > [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de>]
> > Sent: Friday, December 09, 2016 3:46
> > To: John_Tai
> > Cc: users@gridengine.org<mailto:users@gridengine.org>
> > Subject: Re: [gridengine users] CPU complex
> >
> > Hi,
> >
> > Am 09.12.2016 um 08:20 schrieb John_Tai:
> >
> >> I've setup PE but I'm having problems submitting jobs.
> >>
> >> - Here's the PE I created:
> >>
> >> # qconf -sp cores
> >> pe_name            cores
> >> slots              999
> >> user_lists         NONE
> >> xuser_lists        NONE
> >> start_proc_args    /bin/true
> >> stop_proc_args     /bin/true
> >> allocation_rule    $pe_slots
> >> control_slaves     FALSE
> >> job_is_first_task  TRUE
> >> urgency_slots      min
> >> accounting_summary FALSE
> >> qsort_args         NONE
> >>
> >> - I've then added this to all.q:
> >>
> >> qconf -aattr queue pe_list cores all.q
> >
> > How many "slots" were defined in there queue definition for all.q?
> >
> > -- Reuti
> >
> >
> >> - Now I submit a job:
> >>
> >> # qsub -V -b y -cwd -now n -pe cores 2 -q
> >> all.q@ibm038<mailto:all.q@ibm038> xclock Your job
> >> 89 ("xclock") has been submitted # qstat
> >> job-ID  prior   name       user         state submit/start at
>  queue                          slots ja-task-ID
> >> ------------------------------------------------------------
> -----------------------------------------------------
> >>    89 0.00000 xclock     johnt        qw    12/09/2016 15:14:25
>                             2
> >> # qalter -w p 89
> >> Job 89 cannot run in PE "cores" because it only offers 0 slots
> >> verification: no suitable queues
> >> # qstat -f
> >> queuename                      qtype resv/used/tot. load_avg arch
>     states
> >> ------------------------------------------------------------
> ---------------------
> >> all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8
>   0.00     lx-amd64
> >>
> >> #####################################################################
> >> #
> >> ######
> >> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
> >> JOBS ############################################################
> ################
> >>    89 0.55500 xclock     johnt        qw    12/09/2016 15:14:25     2
> >>
> >>
> >> ----------------------------------------------------
> >>
> >> It looks like all.q@ibm038<mailto:all.q@ibm038> should have 8 free
> slots, so why is it only offering 0?
> >>
> >> Hope you can help me.
> >> Thanks
> >> John
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Reuti
> >> [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de>
> >> ]
> >> Sent: Monday, December 05, 2016 6:32
> >> To: John_Tai
> >> Cc: users@gridengine.org<mailto:users@gridengine.org>
> >> Subject: Re: [gridengine users] CPU complex
> >>
> >> Hi,
> >>
> >>> Am 05.12.2016 um 09:36 schrieb John_Tai <john_...@smics.com<mailto:Joh
> n_...@smics.com>>:
> >>>
> >>> Thank you so much for your reply!
> >>>
> >>>>> Will you use the consumable virtual_free here instead mem?
> >>>
> >>> Yes I meant to write virtual_free, not mem. Apologies.
> >>>
> >>>>> For parallel jobs you need to configure a (or some) so called PE
> (Parallel Environment).
> >>>
> >>> My jobs are actually just one process which uses multiple cores, so
> for example in top one process "simv" is currently using 2 cpu cores (200%).
> >>
> >> Yes, then it's a parallel job for SGE. Although the entries for
> start_proc_args resp. stop_proc_args can be left untouched to the default,
> a PE is the paradigm in SGE for a parallel job.
> >>
> >>
> >>> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>> 3017 kelly     20   0 3353m 3.0g 165m R 200.0  0.6  15645:46 simv
> >>>
> >>> So I'm not sure PE is suitable for my case, since it is not multiple
> parallel processes running at the same time. Am I correct?
> >>>
> >>> If so, I am trying to find a way to get SGE to keep track of the
> number of cores used, but I believe it only keeps track of the total CPU
> usage in %. I guess I could use this and and the <total num cores> to get
> the <num of cores in use>, but how to integrate it in SGE?
> >>
> >> You can specify a necessary number of cores for your job in the -pe
> parameter, which can also be a range. The granted allocation by SGE you can
> check in the job script $NHOSTS, $NSLOTS, $PE_HOSTFILE.
> >>
> >> Having this setup, SGE will track the number of used cores per machine.
> The available ones you define in the queue definition. In case you have
> more than one queue per exechost, we need to setup in addition an overall
> limit of cores which can be used at the same time to avoid oversubscription.
> >>
> >> -- Reuti
> >>
> >>> Thank you again for your help.
> >>>
> >>> John
> >>>
> >>> -----Original Message-----
> >>> From: Reuti
> >>> [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de
> >>> >]
> >>> Sent: Monday, December 05, 2016 4:21
> >>> To: John_Tai
> >>> Cc: users@gridengine.org<mailto:users@gridengine.org>
> >>> Subject: Re: [gridengine users] CPU complex
> >>>
> >>> Hi,
> >>>
> >>> Am 05.12.2016 um 08:00 schrieb John_Tai:
> >>>
> >>>> Newbie here, hope to understand SGE usage.
> >>>>
> >>>> I've successfully configured virtual_free as a complex for telling
> SGE how much memory is needed when submitting a job, as described here:
> >>>>
> >>>> https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html#
> >>>> <https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html
> >>>> >
> >>>> i
> >>>> 1000029
> >>>>
> >>>> How do I do the same for telling SGE how many CPU cores a job needs?
> For example:
> >>>>
> >>>>              qsub -l mem=24G,cpu=4 myjob
> >>>
> >>> Will you use the consumable virtual_free here instead mem?
> >>>
> >>>
> >>>> Obviously I'd need for SGE to keep track of the actual CPU
> utilization in the host, just as virtual_free is being tracked
> independently of the SGE jobs.
> >>>
> >>> For parallel jobs you need to configure a (or some) so called PE
> (Parallel Environment). Purpose of this is, to make preparations for the
> parallel jobs like rearranging the list of granted slots, prepare shared
> directories between the nodes,...
> >>>
> >>> These PEs were of higher importance in former times, when parallel
> libraries were not programmed to integrate automatically in SGE for a tight
> integration. Your submissions could read:
> >>>
> >>>  qsub -pe smp 4 myjob        # allocation_rule $peslots,
> control_slaves true
> >>>  qsub -pe orte 16 myjob        # allovation_rule $round_robin,
> control_slaves tue
> >>>
> >>> where smp resp. orte is the chosen parallel environment for OpenMP
> resp. Open MPI. Its settings are explained in `man sge_pe`, the "-pe"
> parameter to in the submission command in `man qsub`.
> >>>
> >>> -- Reuti
> >>> ________________________________
> >>>
> >>> This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please
> notify the sender immediately and delete this email from your computer.
> >>>
> >>
> >> ________________________________
> >>
> >> This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please
> notify the sender immediately and delete this email from your computer.
> >>
> >
> > ________________________________
> >
> > This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please
> notify the sender immediately and delete this email from your computer.
> >
>
> ________________________________
>
> This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please
> notify the sender immediately and delete this email from your computer.
>
> _______________________________________________
> users mailing list
> users@gridengine.org<mailto:users@gridengine.org>
> https://gridengine.org/mailman/listinfo/users
>
> ________________________________
> This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please
> notify the sender immediately and delete this email from your computer.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://gridengine.org/pipermail/users/attachments/
> 20161212/5666d5d4/attachment.html>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
>
> End of users Digest, Vol 72, Issue 13
> *************************************
>
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>



-- 
Best,

Feng

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] users Digest, Vol 72, Issue 13

Reply via email to