Hmmm...not having any luck tracking this down yet. If anything, based on what I 
saw in the code, I would have expected it to fail when hetero-nodes was false, 
not the other way around.

I'll keep poking around - just wanted to provide an update.

On Dec 19, 2013, at 12:54 AM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph, sorry for intersecting post.
> 
> Your advice about -hetero-nodes in other thread gives me a hint.
> 
> I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
> you told me a month ago that my environment would need this option.
> 
> Removing this line from mca-params.conf, then it works.
> In other word, you can replicate it by adding -hetero-nodes as
> shown below.
> 
> qsub: job 8364.manage.cluster completed
> [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8
> qsub: waiting for job 8365.manage.cluster to start
> qsub: job 8365.manage.cluster ready
> 
> [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes
>                MCA orte: parameter "orte_hetero_nodes" (current value:
> "false", data source: default, level: 9 dev/all,
> type: bool)
> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
> [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> Hello world from process 0 of 4
> Hello world from process 1 of 4
> Hello world from process 2 of 4
> Hello world from process 3 of 4
> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> -hetero-nodes myprog
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to:         CORE
>   Node:            node12
>   #processes:  2
>   #cpus:          1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
> 
> 
> As far as I checked, data->num_bound seems to become bad in bind_downwards,
> when I put "-hetero-nodes". I hope you can clear the problem.
> 
> Regards,
> Tetsuya Mishima
> 
> 
>> Yes, it's very strange. But I don't think there's any chance that
>> I have < 8 actual cores on the node. I guess that you cat replicate
>> it with SLURM, please try it again.
>> 
>> I changed to use node10 and node11, then I got the warning against
>> node11.
>> 
>> Furthermore, just as an information for you, I tried to add
>> "-bind-to core:overload-allowed", then it worked as shown below.
>> But I think node11 is never overloaded because it has 8 cores.
>> 
>> qsub: job 8342.manage.cluster completed
>> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
>> qsub: waiting for job 8343.manage.cluster to start
>> qsub: job 8343.manage.cluster ready
>> 
>> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>> [mishima@node10 demos]$ cat $PBS_NODEFILE
>> node10
>> node10
>> node10
>> node10
>> node10
>> node10
>> node10
>> node10
>> node11
>> node11
>> node11
>> node11
>> node11
>> node11
>> node11
>> node11
>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> myprog
>> 
> --------------------------------------------------------------------------
>> A request was made to bind to that would result in binding more
>> processes than cpus on a resource:
>> 
>> Bind to:         CORE
>> Node:            node11
>> #processes:  2
>> #cpus:          1
>> 
>> You can override this protection by adding the "overload-allowed"
>> option to your binding directive.
>> 
> --------------------------------------------------------------------------
>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> -bind-to core:overload-allowed myprog
>> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> socket
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>> Hello world from process 1 of 4
>> Hello world from process 0 of 4
>> Hello world from process 3 of 4
>> Hello world from process 2 of 4
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> 
>>> Very strange - I can't seem to replicate it. Is there any chance that
> you
>> have < 8 actual cores on node12?
>>> 
>>> 
>>> On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
>>>> 
>>>> 
>>>> Hi Ralph, sorry for confusing you.
>>>> 
>>>> At that time, I cut and paste the part of "cat $PBS_NODEFILE".
>>>> I guess I didn't paste the last line by my mistake.
>>>> 
>>>> I retried the test and below one is exactly what I got when I did the
>> test.
>>>> 
>>>> [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8
>>>> qsub: waiting for job 8338.manage.cluster to start
>>>> qsub: job 8338.manage.cluster ready
>>>> 
>>>> [mishima@node11 ~]$ cat $PBS_NODEFILE
>>>> node11
>>>> node11
>>>> node11
>>>> node11
>>>> node11
>>>> node11
>>>> node11
>>>> node11
>>>> node12
>>>> node12
>>>> node12
>>>> node12
>>>> node12
>>>> node12
>>>> node12
>>>> node12
>>>> [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> myprog
>>>> 
>> 
> --------------------------------------------------------------------------
>>>> A request was made to bind to that would result in binding more
>>>> processes than cpus on a resource:
>>>> 
>>>>  Bind to:         CORE
>>>>  Node:            node12
>>>>  #processes:  2
>>>>  #cpus:          1
>>>> 
>>>> You can override this protection by adding the "overload-allowed"
>>>> option to your binding directive.
>>>> 
>> 
> --------------------------------------------------------------------------
>>>> 
>>>> Regards,
>>>> 
>>>> Tetsuya Mishima
>>>> 
>>>>> I removed the debug in #2 - thanks for reporting it
>>>>> 
>>>>> For #1, it actually looks to me like this is correct. If you look at
>> your
>>>> allocation, there are only 7 slots being allocated on node12, yet you
>> have
>>>> asked for 8 cpus to be assigned (2 procs with 2
>>>>> cpus/proc). So the warning is in fact correct
>>>>> 
>>>>> 
>>>>> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So
> I'd
>>>> like
>>>>>> to report
>>>>>> 3 issues mainly regarding -cpus-per-proc.
>>>>>> 
>>>>>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2
>>>> sockets X
>>>>>> 4 cores/socket),
>>>>>> it starts to produce the error again as shown below. At least,
>>>>>> openmpi-1.7.4a1r29646 did
>>>>>> work well.
>>>>>> 
>>>>>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8
>>>>>> qsub: waiting for job 8336.manage.cluster to start
>>>>>> qsub: job 8336.manage.cluster ready
>>>>>> 
>>>>>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>> [mishima@node11 demos]$ cat $PBS_NODEFILE
>>>>>> node11
>>>>>> node11
>>>>>> node11
>>>>>> node11
>>>>>> node11
>>>>>> node11
>>>>>> node11
>>>>>> node11
>>>>>> node12
>>>>>> node12
>>>>>> node12
>>>>>> node12
>>>>>> node12
>>>>>> node12
>>>>>> node12
>>>>>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4
>> -report-bindings
>>>>>> myprog
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>> A request was made to bind to that would result in binding more
>>>>>> processes than cpus on a resource:
>>>>>> 
>>>>>> Bind to:         CORE
>>>>>> Node:            node12
>>>>>> #processes:  2
>>>>>> #cpus:          1
>>>>>> 
>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>> option to your binding directive.
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>> 
>>>>>> Of course it works well using only one node.
>>>>>> 
>>>>>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4
>> -report-bindings
>>>>>> myprog
>>>>>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]],
>>>> socket
>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>>> Hello world from process 1 of 2
>>>>>> Hello world from process 0 of 2
>>>>>> 
>>>>>> 
>>>>>> 2) Adding "-bind-to numa", it works but the message "bind:upward
>> target
>>>>>> NUMANode type NUMANode" appears.
>>>>>> As far as I remember, I didn't see such a kind of message before.
>>>>>> 
>>>>>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4
> -report-bindings
>>>>>> -bind-to numa myprog
>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> type
>>>>>> NUMANode
>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> type
>>>>>> NUMANode
>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> type
>>>>>> NUMANode
>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> type
>>>>>> NUMANode
>>>>>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]],
>>>> socket
>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]],
>>>> socket
>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>>> Hello world from process 1 of 4
>>>>>> Hello world from process 0 of 4
>>>>>> Hello world from process 3 of 4
>>>>>> Hello world from process 2 of 4
>>>>>> 
>>>>>> 
>>>>>> 3) I use PGI compiler. It can not accept compiler switch
>>>>>> "-Wno-variadic-macros", which is
>>>>>> included in configure script.
>>>>>> 
>>>>>>  btl_usnic_CFLAGS="-Wno-variadic-macros"
>>>>>> 
>>>>>> I removed this switch, then I could continue to build 1.7.4rc1.
>>>>>> 
>>>>>> Regards,
>>>>>> Tetsuya Mishima
>>>>>> 
>>>>>> 
>>>>>>> Hmmm...okay, I understand the scenario. Must be something in the
>> algo
>>>>>> when it only has one node, so it shouldn't be too hard to track
> down.
>>>>>>> 
>>>>>>> I'm off on travel for a few days, but will return to this when I
> get
>>>>>> back.
>>>>>>> 
>>>>>>> Sorry for delay - will try to look at this while I'm gone, but
> can't
>>>>>> promise anything :-(
>>>>>>> 
>>>>>>> 
>>>>>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Ralph, sorry for confusing.
>>>>>>>> 
>>>>>>>> We usually logon to "manage", which is our control node.
>>>>>>>> From manage, we submit job or enter a remote node such as
>>>>>>>> node03 by torque interactive mode(qsub -I).
>>>>>>>> 
>>>>>>>> At that time, instead of torque, I just did rsh to node03 from
>> manage
>>>>>>>> and ran myprog on the node. I hope you could understand what I
> did.
>>>>>>>> 
>>>>>>>> Now, I retried with "-host node03", which still causes the
> problem:
>>>>>>>> (I comfirmed local run on manage caused the same problem too)
>>>>>>>> 
>>>>>>>> [mishima@manage ~]$ rsh node03
>>>>>>>> Last login: Wed Dec 11 11:38:57 from manage
>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>> [mishima@node03 demos]$
>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03
> -report-bindings
>>>>>>>> -cpus-per-proc 4 -map-by socket myprog
>>>>>>>> 
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>> processes than cpus on a resource:
>>>>>>>> 
>>>>>>>> Bind to:         CORE
>>>>>>>> Node:            node03
>>>>>>>> #processes:  2
>>>>>>>> #cpus:          1
>>>>>>>> 
>>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>>> option to your binding directive.
>>>>>>>> 
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> It' strange, but I have to report that "-map-by socket:span"
> worked
>>>>>> well.
>>>>>>>> 
>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03
> -report-bindings
>>>>>>>> -cpus-per-proc 4 -map-by socket:span myprog
>>>>>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt
> 0]],
>>>>>> socket
>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>> 
>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt
>> 0]],
>>>>>> socket
>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>> 
>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt
>> 0]],
>>>>>> socket
>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>> 
>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt
>> 0]],
>>>>>> socket
>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>> 
>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt
>> 0]],
>>>>>> socket
>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>> 
>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt
>> 0]],
>>>>>> socket
>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>> 
>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
>>>>>> socket
>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>> 
>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt
> 0]],
>>>>>> socket
>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>> 
>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>> Hello world from process 2 of 8
>>>>>>>> Hello world from process 6 of 8
>>>>>>>> Hello world from process 3 of 8
>>>>>>>> Hello world from process 7 of 8
>>>>>>>> Hello world from process 1 of 8
>>>>>>>> Hello world from process 5 of 8
>>>>>>>> Hello world from process 0 of 8
>>>>>>>> Hello world from process 4 of 8
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Tetsuya Mishima
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Ralph,
>>>>>>>>>> 
>>>>>>>>>> I tried again with -cpus-per-proc 2 as shown below.
>>>>>>>>>> Here, I found that "-map-by socket:span" worked well.
>>>>>>>>>> 
>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
>>>> -cpus-per-proc
>>>>>> 2
>>>>>>>>>> -map-by socket:span myprog
>>>>>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
>>>>>>>>>> /./././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././.
>>>>>>>>>> /./././.][B/B/./././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././.
>>>>>>>>>> /./././.][././B/B/./././.][./././././././.]
>>>>>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././.
>>>>>>>>>> /./././.][./././././././.][B/B/./././././.]
>>>>>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././.
>>>>>>>>>> /./././.][./././././././.][././B/B/./././.]
>>>>>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from
> process 3 of 8
>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
>>>> -cpus-per-proc
>>>>>> 2
>>>>>>>>>> -map-by socket myprog
>>>>>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
>>>>>>>>>> /./././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././.
>>>>>>>>>> /B/B/./.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././.
>>>>>>>>>> /././B/B][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>> 
>>>>>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
>>>>>>>>>> In this case, I guess "-map-by socket:span" and "-map-by
> socket"
>>>> has
>>>>>>>> same
>>>>>>>>>> meaning.
>>>>>>>>>> Therefore, there's no problem about that. Sorry for distubing.
>>>>>>>>> 
>>>>>>>>> No problem - glad you could clear that up :-)
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> By the way, through this test, I found another problem.
>>>>>>>>>> Without torque manager and just using rsh, it causes the same
>> error
>>>>>>>> like
>>>>>>>>>> below:
>>>>>>>>>> 
>>>>>>>>>> [mishima@manage openmpi-1.7]$ rsh node03
>>>>>>>>>> Last login: Wed Dec 11 09:42:02 from manage
>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
>>>> -cpus-per-proc
>>>>>> 4
>>>>>>>>>> -map-by socket myprog
>>>>>>>>> 
>>>>>>>>> I don't understand the difference here - you are simply starting
>> it
>>>>>> from>>>>> a different node? It looks like everything is expected to
>> run local
>>>> to
>>>>>>>> mpirun, yes? So there is no rsh actually involved here.
>>>>>>>>> Are you still running in an allocation?
>>>>>>>>> 
>>>>>>>>> If you run this with "-host node03" on the cmd line, do you see
>> the
>>>>>> same
>>>>>>>> problem?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>> 
>>>>>>>>>> Bind to:         CORE
>>>>>>>>>> Node:            node03
>>>>>>>>>> #processes:  2
>>>>>>>>>> #cpus:          1
>>>>>>>>>> 
>>>>>>>>>> You can override this protection by adding the
> "overload-allowed"
>>>>>>>>>> option to your binding directive.
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>>>>>> [mishima@node03 demos]$
>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
>>>> -cpus-per-proc
>>>>>> 4
>>>>>>>>>> myprog
>>>>>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>> 
>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>> 
>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>> 
>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>> 
>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>> socket 3[core 27[hwt 0]]:>>>>>
>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt
>>>> 0]],
>>>>>>>> socket
>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>> 
>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>> 
>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt
>> 0]],
>>>>>>>> socket
>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>> 
>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>> 
>>>>>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but
>> let
>>>>>> me
>>>>>>>>>> poke around a bit and see what might be happening.
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks. I didn't know the meaning of "socket:span".
>>>>>>>>>>>> 
>>>>>>>>>>>> But it still causes the problem, which seems socket:span
>> doesn't
>>>>>>>> work.
>>>>>>>>>>>> 
>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32
>>>>>>>>>>>> qsub: waiting for job 8265.manage.cluster to start
>>>>>>>>>>>> qsub: job 8265.manage.cluster ready
>>>>>>>>>>>> 
>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
>>>>>> -cpus-per-proc
>>>>>>>> 4
>>>>>>>>>>>> -map-by socket:span myprog
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8
> [hwt
>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12
> [hwt
>>>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16
> [hwt
>>>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20
> [hwt
>>>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24
> [hwt
>>>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28
> [hwt
>>>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0
> [hwt
>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4
> [hwt
>>>> 0]],
>>>>>>>>>> socket
>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>> 
>>>>>> 
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>> 
>>>>>>>>>>>>> No, that is actually correct. We map a socket until full,
> then
>>>>>> move
>>>>>>>> to
>>>>>>>>>>>> the next. What you want is --map-by socket:span
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp
> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I had a time to try your patch yesterday using
>>>>>>>> openmpi-1.7.4a1r29646.
>>>>>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by
>>>>>> socket" itself
>>>>>>>>>>>> didn't
>>>>>>>>>>>>>> work
>>>>>>>>>>>>>> well as shown bellow:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
>>>>>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start
>>>>>>>>>>>>>> qsub: job 8260.manage.cluster ready
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings
>>>>>>>> -cpus-per-proc
>>>>>>>>>> 4
>>>>>>>>>>>>>> -map-by socket myprog
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8
>> [hwt
>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12
>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16
>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20
>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24
>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28
>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0
>> [hwt
>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4
>> [hwt
>>>>>> 0]],
>>>>>>>>>>>> socket
>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think this should be like this:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> rank 00
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>> rank 01
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>>>> rank 02
>>>>>>>>>>>>>> 
>>>>>>>> 
>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of
> RM)
>>>> and
>>>>>>>>>> have
>>>>>>>>>>>>>> scheduled it for 1.7.4.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp
>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you very much for your quick response.>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm afraid to say that I found one more issuse...
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It's not so serious. Please check it when you have a lot
> of
>>>>>> time.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under
>> Torque
>>>>>>>>>>>> manager.
>>>>>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the
>> same
>>>>>>>>>>>>>>>> behaviour under Slurm manager.
> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite
> well.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
>>>>>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
>>>>>>>>>>>>>>>> qsub: job 8116.manage.cluster ready
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2
>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
>>>>>>>>>>>> -cpus-per-proc
>>>>>>>>>>>>>> 4
>>>>>>>>>>>>>>>> -map-by socket mPre
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>> A request was made to bind to that would result in
> binding
>>>> more
>>>>>>>>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Bind to:         CORE
>>>>>>>>>>>>>>>> Node:            node03>>>>>>> #processes:  2
>>>>>>>>>>>>>>>> #cpus:          1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You can override this protection by adding the
>>>>>> "overload-allowed"
>>>>>>>>>>>>>>>> option to your binding directive.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> 
> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
>>>>>>>>>>>> -cpus-per-proc
>>>>>>>>>>>>>> 4
>>>>>>>>>>>>>>>> mPre
>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core
> 8
>>>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core
> 12
>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core
> 16
>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core
> 20
>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core
> 24
>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core
> 28
>>>> [hwt
>>>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> 
>> 
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>>
> 
>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0
>>>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core
> 4
>>>> [hwt
>>>>>>>> 0]],
>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>> Regards,
>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain
>>>> <r...@open-mpi.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when
> I
>>>> had
>>>>>>>>>>>> time :-)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'll update tomorrow.
>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
>>>>>>>>>> <tmish...@jcity.maeda.co.jp>wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in
>>>>>> oob_tcp.c
>>>>>>>>>> of
>>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646".
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I found the cause.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine
>> can
>>>>>> not.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Your host file:
>>>>>>>>>>>>>>>>> cat hosts
>>>>>>>>>>>>>>>>> bend001 slots=12
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> My host file:
>>>>>>>>>>>>>>>>> cat hosts
>>>>>>>>>>>>>>>>> node08
>>>>>>>>>>>>>>>>> node08
>>>>>>>>>>>>>>>>> ...(total 8 lines)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line
> of
>>>> my
>>>>>>>>>>>> hostfile
>>>>>>>>>>>>>>>>> just before launching mpirun. Then it worked.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> My host file(modified):
>>>>>>>>>>>>>>>>> cat hosts
>>>>>>>>>>>>>>>>> node08 slots=1
>>>>>>>>>>>>>>>>> node08 slots=1
>>>>>>>>>>>>>>>>> ...(total 8 lines)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference
>>>>>> between
>>>>>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
>>>>>>>> 1.7.4a1r29646.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> $ diff
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>> 
>>>> 
> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
>>>>>>>>>>>>>>>>> 394,401c394,399
>>>>>>>>>>>>>>>>> <     if (got_count) {
>>>>>>>>>>>>>>>>> <         node->slots_given = true;
>>>>>>>>>>>>>>>>> <     } else if (got_max) {
>>>>>>>>>>>>>>>>> <         node->slots = node->slots_max;
>>>>>>>>>>>>>>>>> <         node->slots_given = true;
>>>>>>>>>>>>>>>>> <     } else {
>>>>>>>>>>>>>>>>> <         /* should be set by obj_new, but just to be
>> clear
>>>> */
>>>>>>>>>>>>>>>>> <         node->slots_given = false;
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>> if (!got_count) {
>>>>>>>>>>>>>>>>>>  if (got_max) {
>>>>>>>>>>>>>>>>>>      node->slots = node->slots_max;
>>>>>>>>>>>>>>>>>>  } else {
>>>>>>>>>>>>>>>>>>      ++node->slots;>>>>>>>>>>>>>    }
>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative
>>>> trial.
>>>>>>>>>>>>>>>>> Then, it worked.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>> 394      if (got_count) {
>>>>>>>>>>>>>>>>> 395          node->slots_given = true;
>>>>>>>>>>>>>>>>> 396      } else if (got_max) {
>>>>>>>>>>>>>>>>> 397          node->slots = node->slots_max;
>>>>>>>>>>>>>>>>> 398          node->slots_given = true;
>>>>>>>>>>>>>>>>> 399      } else {
>>>>>>>>>>>>>>>>> 400          /* should be set by obj_new, but just to be
>>>> clear
>>>>>>>> */
>>>>>>>>>>>>>>>>> 401          node->slots_given
>>>> = false;
>>>>>>>>>>>>>>>>> 402          ++node->slots; /* added by tmishima */
>>>>>>>>>>>>>>>>> 403      }
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Please fix the problem properly, because it's just based
>> on
>>>> my
>>>>>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile
>>>> where
>>>>>>>>>> slots
>>>>>>>>>>>>>>>>> information is not given.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> 
> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________
> 
>> 
>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to