Hmmm...not having any luck tracking this down yet. If anything, based on what I saw in the code, I would have expected it to fail when hetero-nodes was false, not the other way around.
I'll keep poking around - just wanted to provide an update. On Dec 19, 2013, at 12:54 AM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, sorry for intersecting post. > > Your advice about -hetero-nodes in other thread gives me a hint. > > I already put "orte_hetero_nodes = 1" in my mca-params.conf, because > you told me a month ago that my environment would need this option. > > Removing this line from mca-params.conf, then it works. > In other word, you can replicate it by adding -hetero-nodes as > shown below. > > qsub: job 8364.manage.cluster completed > [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8 > qsub: waiting for job 8365.manage.cluster to start > qsub: job 8365.manage.cluster ready > > [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes > MCA orte: parameter "orte_hetero_nodes" (current value: > "false", data source: default, level: 9 dev/all, > type: bool) > [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > myprog > [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > Hello world from process 0 of 4 > Hello world from process 1 of 4 > Hello world from process 2 of 4 > Hello world from process 3 of 4 > [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > -hetero-nodes myprog > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: node12 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > > As far as I checked, data->num_bound seems to become bad in bind_downwards, > when I put "-hetero-nodes". I hope you can clear the problem. > > Regards, > Tetsuya Mishima > > >> Yes, it's very strange. But I don't think there's any chance that >> I have < 8 actual cores on the node. I guess that you cat replicate >> it with SLURM, please try it again. >> >> I changed to use node10 and node11, then I got the warning against >> node11. >> >> Furthermore, just as an information for you, I tried to add >> "-bind-to core:overload-allowed", then it worked as shown below. >> But I think node11 is never overloaded because it has 8 cores. >> >> qsub: job 8342.manage.cluster completed >> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8 >> qsub: waiting for job 8343.manage.cluster to start >> qsub: job 8343.manage.cluster ready >> >> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >> [mishima@node10 demos]$ cat $PBS_NODEFILE >> node10 >> node10 >> node10 >> node10 >> node10 >> node10 >> node10 >> node10 >> node11 >> node11 >> node11 >> node11 >> node11 >> node11 >> node11 >> node11 >> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >> myprog >> > -------------------------------------------------------------------------- >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >> Bind to: CORE >> Node: node11 >> #processes: 2 >> #cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> > -------------------------------------------------------------------------- >> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >> -bind-to core:overload-allowed myprog >> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]], > socket >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]], > socket >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]], > socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >> Hello world from process 1 of 4 >> Hello world from process 0 of 4 >> Hello world from process 3 of 4 >> Hello world from process 2 of 4 >> >> Regards, >> Tetsuya Mishima >> >> >>> Very strange - I can't seem to replicate it. Is there any chance that > you >> have < 8 actual cores on node12? >>> >>> >>> On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote: >>> >>>> >>>> >>>> Hi Ralph, sorry for confusing you. >>>> >>>> At that time, I cut and paste the part of "cat $PBS_NODEFILE". >>>> I guess I didn't paste the last line by my mistake. >>>> >>>> I retried the test and below one is exactly what I got when I did the >> test. >>>> >>>> [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8 >>>> qsub: waiting for job 8338.manage.cluster to start >>>> qsub: job 8338.manage.cluster ready >>>> >>>> [mishima@node11 ~]$ cat $PBS_NODEFILE >>>> node11 >>>> node11 >>>> node11 >>>> node11 >>>> node11 >>>> node11 >>>> node11 >>>> node11 >>>> node12 >>>> node12 >>>> node12 >>>> node12 >>>> node12 >>>> node12 >>>> node12 >>>> node12 >>>> [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >> myprog >>>> >> > -------------------------------------------------------------------------- >>>> A request was made to bind to that would result in binding more >>>> processes than cpus on a resource: >>>> >>>> Bind to: CORE >>>> Node: node12 >>>> #processes: 2 >>>> #cpus: 1 >>>> >>>> You can override this protection by adding the "overload-allowed" >>>> option to your binding directive. >>>> >> > -------------------------------------------------------------------------- >>>> >>>> Regards, >>>> >>>> Tetsuya Mishima >>>> >>>>> I removed the debug in #2 - thanks for reporting it >>>>> >>>>> For #1, it actually looks to me like this is correct. If you look at >> your >>>> allocation, there are only 7 slots being allocated on node12, yet you >> have >>>> asked for 8 cpus to be assigned (2 procs with 2 >>>>> cpus/proc). So the warning is in fact correct >>>>> >>>>> >>>>> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote: >>>>> >>>>>> >>>>>> >>>>>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So > I'd >>>> like >>>>>> to report >>>>>> 3 issues mainly regarding -cpus-per-proc. >>>>>> >>>>>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 >>>> sockets X >>>>>> 4 cores/socket), >>>>>> it starts to produce the error again as shown below. At least, >>>>>> openmpi-1.7.4a1r29646 did >>>>>> work well. >>>>>> >>>>>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8 >>>>>> qsub: waiting for job 8336.manage.cluster to start >>>>>> qsub: job 8336.manage.cluster ready >>>>>> >>>>>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>> [mishima@node11 demos]$ cat $PBS_NODEFILE >>>>>> node11 >>>>>> node11 >>>>>> node11 >>>>>> node11 >>>>>> node11 >>>>>> node11 >>>>>> node11 >>>>>> node11 >>>>>> node12 >>>>>> node12 >>>>>> node12 >>>>>> node12 >>>>>> node12 >>>>>> node12 >>>>>> node12 >>>>>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 >> -report-bindings >>>>>> myprog >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>> A request was made to bind to that would result in binding more >>>>>> processes than cpus on a resource: >>>>>> >>>>>> Bind to: CORE >>>>>> Node: node12 >>>>>> #processes: 2 >>>>>> #cpus: 1 >>>>>> >>>>>> You can override this protection by adding the "overload-allowed" >>>>>> option to your binding directive. >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>> >>>>>> Of course it works well using only one node. >>>>>> >>>>>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4 >> -report-bindings >>>>>> myprog >>>>>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>> socket >>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], >>>> socket >>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>>> Hello world from process 1 of 2 >>>>>> Hello world from process 0 of 2 >>>>>> >>>>>> >>>>>> 2) Adding "-bind-to numa", it works but the message "bind:upward >> target >>>>>> NUMANode type NUMANode" appears. >>>>>> As far as I remember, I didn't see such a kind of message before. >>>>>> >>>>>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 > -report-bindings >>>>>> -bind-to numa myprog >>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > type >>>>>> NUMANode >>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > type >>>>>> NUMANode >>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > type >>>>>> NUMANode >>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > type >>>>>> NUMANode >>>>>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>> socket >>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], >>>> socket >>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], >>>> socket >>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], >>>> socket >>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>>> Hello world from process 1 of 4 >>>>>> Hello world from process 0 of 4 >>>>>> Hello world from process 3 of 4 >>>>>> Hello world from process 2 of 4 >>>>>> >>>>>> >>>>>> 3) I use PGI compiler. It can not accept compiler switch >>>>>> "-Wno-variadic-macros", which is >>>>>> included in configure script. >>>>>> >>>>>> btl_usnic_CFLAGS="-Wno-variadic-macros" >>>>>> >>>>>> I removed this switch, then I could continue to build 1.7.4rc1. >>>>>> >>>>>> Regards, >>>>>> Tetsuya Mishima >>>>>> >>>>>> >>>>>>> Hmmm...okay, I understand the scenario. Must be something in the >> algo >>>>>> when it only has one node, so it shouldn't be too hard to track > down. >>>>>>> >>>>>>> I'm off on travel for a few days, but will return to this when I > get >>>>>> back. >>>>>>> >>>>>>> Sorry for delay - will try to look at this while I'm gone, but > can't >>>>>> promise anything :-( >>>>>>> >>>>>>> >>>>>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Ralph, sorry for confusing. >>>>>>>> >>>>>>>> We usually logon to "manage", which is our control node. >>>>>>>> From manage, we submit job or enter a remote node such as >>>>>>>> node03 by torque interactive mode(qsub -I). >>>>>>>> >>>>>>>> At that time, instead of torque, I just did rsh to node03 from >> manage >>>>>>>> and ran myprog on the node. I hope you could understand what I > did. >>>>>>>> >>>>>>>> Now, I retried with "-host node03", which still causes the > problem: >>>>>>>> (I comfirmed local run on manage caused the same problem too) >>>>>>>> >>>>>>>> [mishima@manage ~]$ rsh node03 >>>>>>>> Last login: Wed Dec 11 11:38:57 from manage >>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>> [mishima@node03 demos]$ >>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 > -report-bindings >>>>>>>> -cpus-per-proc 4 -map-by socket myprog >>>>>>>> >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>>>> A request was made to bind to that would result in binding more >>>>>>>> processes than cpus on a resource: >>>>>>>> >>>>>>>> Bind to: CORE >>>>>>>> Node: node03 >>>>>>>> #processes: 2 >>>>>>>> #cpus: 1 >>>>>>>> >>>>>>>> You can override this protection by adding the "overload-allowed" >>>>>>>> option to your binding directive. >>>>>>>> >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> It' strange, but I have to report that "-map-by socket:span" > worked >>>>>> well. >>>>>>>> >>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 > -report-bindings >>>>>>>> -cpus-per-proc 4 -map-by socket:span myprog >>>>>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt > 0]], >>>>>> socket >>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>> >> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt >> 0]], >>>>>> socket >>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>> >> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt >> 0]], >>>>>> socket >>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>> >> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt >> 0]], >>>>>> socket >>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>> >> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt >> 0]], >>>>>> socket >>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>> >> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt >> 0]], >>>>>> socket >>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>> >> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt > 0]], >>>>>> socket >>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>> >> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt > 0]], >>>>>> socket >>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>> >> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>> Hello world from process 2 of 8 >>>>>>>> Hello world from process 6 of 8 >>>>>>>> Hello world from process 3 of 8 >>>>>>>> Hello world from process 7 of 8 >>>>>>>> Hello world from process 1 of 8 >>>>>>>> Hello world from process 5 of 8 >>>>>>>> Hello world from process 0 of 8 >>>>>>>> Hello world from process 4 of 8 >>>>>>>> >>>>>>>> Regards, >>>>>>>> Tetsuya Mishima >>>>>>>> >>>>>>>> >>>>>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Ralph, >>>>>>>>>> >>>>>>>>>> I tried again with -cpus-per-proc 2 as shown below. >>>>>>>>>> Here, I found that "-map-by socket:span" worked well. >>>>>>>>>> >>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>> -cpus-per-proc >>>>>> 2 >>>>>>>>>> -map-by socket:span myprog >>>>>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>>>>>>> /./././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././. >>>>>>>>>> /./././.][B/B/./././././.][./././././././.] >>>>>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././. >>>>>>>>>> /./././.][././B/B/./././.][./././././././.] >>>>>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././. >>>>>>>>>> /./././.][./././././././.][B/B/./././././.] >>>>>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././. >>>>>>>>>> /./././.][./././././././.][././B/B/./././.] >>>>>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from > process 3 of 8 >>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>> -cpus-per-proc >>>>>> 2 >>>>>>>>>> -map-by socket myprog >>>>>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>>>>>>> /./././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././. >>>>>>>>>> /B/B/./.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././. >>>>>>>>>> /././B/B][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>> >>>>>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. >>>>>>>>>> In this case, I guess "-map-by socket:span" and "-map-by > socket" >>>> has >>>>>>>> same >>>>>>>>>> meaning. >>>>>>>>>> Therefore, there's no problem about that. Sorry for distubing. >>>>>>>>> >>>>>>>>> No problem - glad you could clear that up :-) >>>>>>>>> >>>>>>>>>> >>>>>>>>>> By the way, through this test, I found another problem. >>>>>>>>>> Without torque manager and just using rsh, it causes the same >> error >>>>>>>> like >>>>>>>>>> below: >>>>>>>>>> >>>>>>>>>> [mishima@manage openmpi-1.7]$ rsh node03 >>>>>>>>>> Last login: Wed Dec 11 09:42:02 from manage >>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>> -cpus-per-proc >>>>>> 4 >>>>>>>>>> -map-by socket myprog >>>>>>>>> >>>>>>>>> I don't understand the difference here - you are simply starting >> it >>>>>> from>>>>> a different node? It looks like everything is expected to >> run local >>>> to >>>>>>>> mpirun, yes? So there is no rsh actually involved here. >>>>>>>>> Are you still running in an allocation? >>>>>>>>> >>>>>>>>> If you run this with "-host node03" on the cmd line, do you see >> the >>>>>> same >>>>>>>> problem? >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>>>>>> A request was made to bind to that would result in binding more >>>>>>>>>> processes than cpus on a resource: >>>>>>>>>> >>>>>>>>>> Bind to: CORE >>>>>>>>>> Node: node03 >>>>>>>>>> #processes: 2 >>>>>>>>>> #cpus: 1 >>>>>>>>>> >>>>>>>>>> You can override this protection by adding the > "overload-allowed" >>>>>>>>>> option to your binding directive. >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>>>>>> [mishima@node03 demos]$ >>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>> -cpus-per-proc >>>>>> 4 >>>>>>>>>> myprog >>>>>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>> >>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>> >>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>> >>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>> >>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>> socket 3[core 27[hwt 0]]:>>>>> >>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt >>>> 0]], >>>>>>>> socket >>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>> >>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>> >>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt >> 0]], >>>>>>>> socket >>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>> >>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Tetsuya Mishima >>>>>>>>>> >>>>>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but >> let >>>>>> me >>>>>>>>>> poke around a bit and see what might be happening. >>>>>>>>>>> >>>>>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>> >>>>>>>>>>>> Thanks. I didn't know the meaning of "socket:span". >>>>>>>>>>>> >>>>>>>>>>>> But it still causes the problem, which seems socket:span >> doesn't >>>>>>>> work. >>>>>>>>>>>> >>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 >>>>>>>>>>>> qsub: waiting for job 8265.manage.cluster to start >>>>>>>>>>>> qsub: job 8265.manage.cluster ready >>>>>>>>>>>> >>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>>>> -cpus-per-proc >>>>>>>> 4 >>>>>>>>>>>> -map-by socket:span myprog >>>>>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8 > [hwt >>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>>> >>>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12 > [hwt >>>>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>>> >>>>>> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16 > [hwt >>>>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>>> >>>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20 > [hwt >>>>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>>> >>>>>> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24 > [hwt >>>>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>>> >>>>>> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28 > [hwt >>>>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>>> >>>>>> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0 > [hwt >>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>>> >>>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4 > [hwt >>>> 0]], >>>>>>>>>> socket >>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>>> >>>>>> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>> >>>>>>>>>>>>> No, that is actually correct. We map a socket until full, > then >>>>>> move >>>>>>>> to >>>>>>>>>>>> the next. What you want is --map-by socket:span >>>>>>>>>>>>> >>>>>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp > wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I had a time to try your patch yesterday using >>>>>>>> openmpi-1.7.4a1r29646. >>>>>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by >>>>>> socket" itself >>>>>>>>>>>> didn't >>>>>>>>>>>>>> work >>>>>>>>>>>>>> well as shown bellow: >>>>>>>>>>>>>> >>>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start >>>>>>>>>>>>>> qsub: job 8260.manage.cluster ready >>>>>>>>>>>>>> >>>>>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings >>>>>>>> -cpus-per-proc >>>>>>>>>> 4 >>>>>>>>>>>>>> -map-by socket myprog >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8 >> [hwt >>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12 >> [hwt >>>>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16 >> [hwt >>>>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20 >> [hwt >>>>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24 >> [hwt >>>>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28 >> [hwt >>>>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0 >> [hwt >>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4 >> [hwt >>>>>> 0]], >>>>>>>>>>>> socket >>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>>>>> >>>>>>>> >> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think this should be like this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> rank 00 >>>>>>>>>>>>>> >>>>>>>> >> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>> rank 01 >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>>> rank 02 >>>>>>>>>>>>>> >>>>>>>> >> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of > RM) >>>> and >>>>>>>>>> have >>>>>>>>>>>>>> scheduled it for 1.7.4. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp >> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you very much for your quick response.> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm afraid to say that I found one more issuse... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It's not so serious. Please check it when you have a lot > of >>>>>> time. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under >> Torque >>>>>>>>>>>> manager. >>>>>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the >> same >>>>>>>>>>>>>>>> behaviour under Slurm manager. > >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite > well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start >>>>>>>>>>>>>>>> qsub: job 8116.manage.cluster ready >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 >>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>>>>>>> -cpus-per-proc >>>>>>>>>>>>>> 4 >>>>>>>>>>>>>>>> -map-by socket mPre >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>>>>>>>>>>>> A request was made to bind to that would result in > binding >>>> more >>>>>>>>>>>>>>>> processes than cpus on a resource: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Bind to: CORE >>>>>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2 >>>>>>>>>>>>>>>> #cpus: 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You can override this protection by adding the >>>>>> "overload-allowed" >>>>>>>>>>>>>>>> option to your binding directive. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> > -------------------------------------------------------------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>>>>>>> -cpus-per-proc >>>>>>>>>>>>>> 4 >>>>>>>>>>>>>>>> mPre >>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core > 8 >>>> [hwt >>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core > 12 >>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core > 16 >>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core > 20 >>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core > 24 >>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core > 28 >>>> [hwt >>>>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> >> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>> > >> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0 >>>> [hwt >>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core > 4 >>>> [hwt >>>>>>>> 0]], >>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >> Regards, >>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain >>>> <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when > I >>>> had >>>>>>>>>>>> time :-) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'll update tomorrow. >>>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, >>>>>>>>>> <tmish...@jcity.maeda.co.jp>wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in >>>>>> oob_tcp.c >>>>>>>>>> of >>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646". >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I found the cause. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine >> can >>>>>> not. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Your host file: >>>>>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>>>>> bend001 slots=12 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My host file: >>>>>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>>>>> node08 >>>>>>>>>>>>>>>>> node08 >>>>>>>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line > of >>>> my >>>>>>>>>>>> hostfile >>>>>>>>>>>>>>>>> just before launching mpirun. Then it worked. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My host file(modified): >>>>>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference >>>>>> between >>>>>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of >>>>>>>> 1.7.4a1r29646. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> $ diff >>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>> >>>> > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >>>>>>>>>>>>>>>>> 394,401c394,399 >>>>>>>>>>>>>>>>> < if (got_count) { >>>>>>>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>>>>>>> < } else if (got_max) { >>>>>>>>>>>>>>>>> < node->slots = node->slots_max; >>>>>>>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>>>>>>> < } else { >>>>>>>>>>>>>>>>> < /* should be set by obj_new, but just to be >> clear >>>> */ >>>>>>>>>>>>>>>>> < node->slots_given = false; >>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>> if (!got_count) { >>>>>>>>>>>>>>>>>> if (got_max) { >>>>>>>>>>>>>>>>>> node->slots = node->slots_max; >>>>>>>>>>>>>>>>>> } else { >>>>>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> .... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative >>>> trial. >>>>>>>>>>>>>>>>> Then, it worked. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> 394 if (got_count) { >>>>>>>>>>>>>>>>> 395 node->slots_given = true; >>>>>>>>>>>>>>>>> 396 } else if (got_max) { >>>>>>>>>>>>>>>>> 397 node->slots = node->slots_max; >>>>>>>>>>>>>>>>> 398 node->slots_given = true; >>>>>>>>>>>>>>>>> 399 } else { >>>>>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be >>>> clear >>>>>>>> */ >>>>>>>>>>>>>>>>> 401 node->slots_given >>>> = false; >>>>>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */ >>>>>>>>>>>>>>>>> 403 } >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please fix the problem properly, because it's just based >> on >>>> my >>>>>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile >>>> where >>>>>>>>>> slots >>>>>>>>>>>>>>>>> information is not given. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >> >>>> >>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> >>>>>>>>>> >>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users