Re: [gridengine users] Qlogin and Core binding

Txema Heredia Fri, 27 Jun 2014 05:48:36 -0700

Dang! You are right!

The "incoherence" among jobs is due to the first core of the firstsocket being available. On my previous socket report, all "linear X:0,0"that were correctly reported were only the ones that could start in thefirst core.

I have just modified my jsv to set the policy to linear_automatic, andnow it works fine!


Being the nodes:

compute-1-8             lx26-amd64     12  6.95   94.6G   32.3G 9.8G   39.4M
   hl:m_topology_inuse=SccccccSCCCCCC
binding:                    set linear:6:0,0
binding    1:               SccccccSCCCCCC
binding:                    set linear:1:0,0
binding    1:               NONE

compute-1-9             lx26-amd64     12  0.01   94.6G   10.1G 9.8G   39.0M
   hl:m_topology_inuse=SCCCCCCSCCCCCC

compute-1-8 has the 1st core already bound, and compute-1-9 has it free.

I submit several single-core qlogin to both nodes:

(compute-1-8)
[root@floquet ~]# qstat -j 4564595 -cb | grep binding
binding:                    set linear:1:0,0
binding    1:               NONE
(compute-1-9)
[root@floquet ~]# qstat -j 4564594 -cb | grep binding
binding:                    set linear:1:0,0
binding    1:               ScCCCCCSCCCCCC


Now I change the policy to linear_automatic and:

(compute-1-8)
[root@floquet ~]# qstat -j 4564597 -cb | grep binding
binding:                    set linear:1
binding    1:               SCCCCCCScCCCCC
(compute-1-9)
[root@floquet ~]# qstat -j 4564596 -cb | grep binding
binding:                    set linear:1
binding    1:               ScCCCCCSCCCCCC


Thanks!!

Txema

El 27/06/14 13:19, Daniel Gruber escribió:

Hi,

Please notice the difference between "set linear:1:0,0“ and
"set linear:1“. The first one means - give me one core starting
at socket 0 core 0 (which means here obviously you are
requesting core 0 on socket 0). The second means that
you want one core on the host and the execution daemon
takes care which one.

So per design the core selection is done on the execd in SGE -
while in Univa Grid Engine we moved that to the qmaster
itself (which has many advantages due to the global
view of the cluster / job and core usage).

If now the execd in your case tries to bind the job it figures
out that a different job already uses this core and therefore
SGE just don’t do any binding for the job (in order to avoid
overallocation).

I guess your linear:1:0,0 request is not by intention - it does
only make sense in scenarios where you are using your
host exclusively for one job.

This is probably caused by your JSV script - which sets binding_strategy

to „linear“ (linear:X:S,C) instead of „linear_automatic“ (linear:X).Obviously

the naming of the JSV parameter argument is unfortunate.

Might this be the reason?

Cheers

Daniel

Am 27.06.2014 um 12:58 schrieb Txema Heredia <[email protected]<mailto:[email protected]>>:

El 27/06/14 12:32, Reuti escribió:

Am 27.06.2014 um 12:24 schrieb Txema Heredia:

El 27/06/14 11:31, Reuti escribió:

Hi,

Am 26.06.2014 um 17:56 schrieb Txema Heredia:

<snip>

# qstat -j 4561291 -cb | grep "job_name\|binding\|queue_list"
job_name:                   c0-1
hard_queue_list:            *@compute-0-1.local
binding:                    set linear:1:0,0
binding    1:               NONE

What I am missing here? What can be different in my nodes?

Does `qhost -F` output the fields:

$ qhost -F
...
   hl:m_topology=SC
   hl:m_topology_inuse=SC
   hl:m_socket=1.000000
   hl:m_core=1.000000

for this machine?

-- Reuti

Yes, qhost -F reports that for all nodes:

# qhost -F | grep "compute\|hl:m_"

compute-0-0 lx26-amd64 12 0.60 94.6G 10.1G9.8G 53.8M

  hl:m_topology=SCCCCCCSCCCCCC
  hl:m_topology_inuse=SCCCCCCSCCCCCC
  hl:m_socket=2.000000
  hl:m_core=12.000000

compute-0-1 lx26-amd64 12 7.21 94.6G 14.9G9.8G 86.6M

  hl:m_topology=SCCCCCCSCCCCCC
  hl:m_topology_inuse=ScCCCCCSCCCCCC
  hl:m_socket=2.000000
  hl:m_core=12.000000
...


But the inuse topology is blatantly wrong.

What version of SGE are you using? Maybe the "PLPA" which was usedin former versions doesn't support this particular CPU's topology.It was replaced by "hwloc" later on.


-- Reuti

Originally it was SGE 6.2u5, but later on I substituted thesge_qmaster binary for OGS/GE 2011.11p1 (due to a problem withparallel jobs and -hold_jid)



_______________________________________________
users mailing list
[email protected] <mailto:[email protected]>
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Qlogin and Core binding

Reply via email to