Re: [OMPI devel] mpirun error when not using span

2018-09-11 Thread Ralph H Castain
I believe the problem is actually a little different than you described. The 
issue occurs whenever the #procs combined with PE exceeds the number of cores 
on a node. It is caused by the fact that we aren’t considering the PE number 
when mapping processes - we only appear to be looking at it when binding. I’ll 
try to poke at it a bit.


> On Sep 11, 2018, at 9:17 AM, Shrader, David Lee  wrote:
> 
> Here's the xml output from lstopo. Thank you for taking a look!
> David
> 
> From: devel  on behalf of Ralph H Castain 
> 
> Sent: Monday, September 10, 2018 5:12 PM
> To: OpenMPI Devel
> Subject: Re: [OMPI devel] mpirun error when not using span
>  
> Could you please send the output from “lstopo --of xml foo.xml” (the file 
> foo.xml) so I can try to replicate here?
> 
> 
>> On Sep 4, 2018, at 12:35 PM, Shrader, David Lee > <mailto:dshra...@lanl.gov>> wrote:
>> 
>> Hello,
>> 
>> I have run this issue by Howard, and he asked me to forward it on to the 
>> Open MPI devel mailing list. I get an error when trying to use PE=n with 
>> '--map-by numa' and not using span when using more than one node:
>> 
>> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to 
>> core --report-bindings true
>> --
>> A request was made to bind to that would result in binding more
>> processes than cpus on a resource:
>> 
>>Bind to: CORE
>>Node:ba001
>>#processes:  2
>>#cpus:   1
>> 
>> You can override this protection by adding the "overload-allowed"
>> option to your binding directive.
>> --
>> 
>> The absolute values of the numbers passed to -n and PE don't really matter; 
>> the error pops up as soon as those numbers are combined in such a way that 
>> an MPI rank ends up on the second node.
>> 
>> If I add the "span" parameter, everything works as expected:
>> 
>> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span 
>> --bind-to core --report-bindings true
>> [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], 
>> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
>> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
>> [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], 
>> socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
>> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
>> [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], 
>> socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
>> [././././././././B/B/B/B/./././././.][./././././././././././././././././.]
>> [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], 
>> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 
>> 0]]: 
>> [././././././././././././B/B/B/B/./.][./././././././././././././././././.]
>> [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], 
>> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 
>> 0]]: 
>> [./././././././././././././././././.][B/B/B/B/./././././././././././././.]
>> [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], 
>> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 
>> 0]]: 
>> [./././././././././././././././././.][././././B/B/B/B/./././././././././.]
>> [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], 
>> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 
>> 0]]: 
>> [./././././././././././././././././.][././././././././B/B/B/B/./././././.]
>> [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], 
>> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 
>> 0]]: 
>> [./././././././././././././././././.][././././././././././././B/B/B/B/./.]
>> [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], 
>> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
>> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
>> [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], 
>> socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
>> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
>> [ba001.localdomain:11700] MCW rank 2 bound 

Re: [OMPI devel] mpirun error when not using span

2018-09-11 Thread Shrader, David Lee
Here's the xml output from lstopo. Thank you for taking a look!

David


From: devel  on behalf of Ralph H Castain 

Sent: Monday, September 10, 2018 5:12 PM
To: OpenMPI Devel
Subject: Re: [OMPI devel] mpirun error when not using span

Could you please send the output from “lstopo --of xml foo.xml” (the file 
foo.xml) so I can try to replicate here?


On Sep 4, 2018, at 12:35 PM, Shrader, David Lee 
mailto:dshra...@lanl.gov>> wrote:

Hello,

I have run this issue by Howard, and he asked me to forward it on to the Open 
MPI devel mailing list. I get an error when trying to use PE=n with '--map-by 
numa' and not using span when using more than one node:


[dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to core 
--report-bindings true
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:ba001
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

The absolute values of the numbers passed to -n and PE don't really matter; the 
error pops up as soon as those numbers are combined in such a way that an MPI 
rank ends up on the second node.

If I add the "span" parameter, everything works as expected:


[dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span --bind-to 
core --report-bindings true
[ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
[B/B/B/B/./././././././././././././.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket 
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
[././././B/B/B/B/./././././././././.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], socket 
0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
[././././././././B/B/B/B/./././././.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], socket 
0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
[././././././././././././B/B/B/B/./.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], socket 
1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
[./././././././././././././././././.][B/B/B/B/./././././././././././././.]
[ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], socket 
1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
[./././././././././././././././././.][././././B/B/B/B/./././././././././.]
[ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], socket 
1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
[./././././././././././././././././.][././././././././B/B/B/B/./././././.]
[ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], socket 
1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: 
[./././././././././././././././././.][././././././././././././B/B/B/B/./.]
[ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
[B/B/B/B/./././././././././././././.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
[././././B/B/B/B/./././././././././.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket 
0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
[././././././././B/B/B/B/./././././.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], socket 
0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
[././././././././././././B/B/B/B/./.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], socket 
1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
[./././././././././././././././././.][B/B/B/B/./././././././././././././.]
[ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], socket 
1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
[./././././././././././././././././.][././././B/B/B/B/./././././././././.]
[ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], socket 
1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
[./././././././././././././././././

Re: [OMPI devel] mpirun error when not using span

2018-09-10 Thread Ralph H Castain
Could you please send the output from “lstopo --of xml foo.xml” (the file 
foo.xml) so I can try to replicate here?


> On Sep 4, 2018, at 12:35 PM, Shrader, David Lee  wrote:
> 
> Hello,
> 
> I have run this issue by Howard, and he asked me to forward it on to the Open 
> MPI devel mailing list. I get an error when trying to use PE=n with '--map-by 
> numa' and not using span when using more than one node:
> 
> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to 
> core --report-bindings true
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>Bind to: CORE
>Node:ba001
>#processes:  2
>#cpus:   1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
> 
> The absolute values of the numbers passed to -n and PE don't really matter; 
> the error pops up as soon as those numbers are combined in such a way that an 
> MPI rank ends up on the second node.
> 
> If I add the "span" parameter, everything works as expected:
> 
> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span 
> --bind-to core --report-bindings true
> [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket 
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], 
> socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
> [././././././././B/B/B/B/./././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], 
> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
> [././././././././././././B/B/B/B/./.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], 
> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
> [./././././././././././././././././.][B/B/B/B/./././././././././././././.]
> [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], 
> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
> [./././././././././././././././././.][././././B/B/B/B/./././././././././.]
> [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], 
> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
> [./././././././././././././././././.][././././././././B/B/B/B/./././././.]
> [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], 
> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: 
> [./././././././././././././././././.][././././././././././././B/B/B/B/./.]
> [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket 
> 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
> [././././././././B/B/B/B/./././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], 
> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
> [././././././././././././B/B/B/B/./.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], 
> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
> [./././././././././././././././././.][B/B/B/B/./././././././././././././.]
> [ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], 
> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
> [./././././././././././././././././.][././././B/B/B/B/./././././././././.]
> [ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], 
> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
> [./././././././././././././././././.][././././././././B/B/B/B/./././././.]
> [ba001.localdomain:11700] MCW rank 7 bound to socket 1[core 30[hwt 0]], 
> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: 
> [.

[OMPI devel] mpirun error when not using span

2018-09-04 Thread Shrader, David Lee
Hello,


I have run this issue by Howard, and he asked me to forward it on to the Open 
MPI devel mailing list. I get an error when trying to use PE=n with '--map-by 
numa' and not using span when using more than one node:


[dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to core 
--report-bindings true
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:ba001
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

The absolute values of the numbers passed to -n and PE don't really matter; the 
error pops up as soon as those numbers are combined in such a way that an MPI 
rank ends up on the second node.


If I add the "span" parameter, everything works as expected:


[dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span --bind-to 
core --report-bindings true
[ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
[B/B/B/B/./././././././././././././.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket 
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
[././././B/B/B/B/./././././././././.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], socket 
0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
[././././././././B/B/B/B/./././././.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], socket 
0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
[././././././././././././B/B/B/B/./.][./././././././././././././././././.]
[ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], socket 
1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
[./././././././././././././././././.][B/B/B/B/./././././././././././././.]
[ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], socket 
1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
[./././././././././././././././././.][././././B/B/B/B/./././././././././.]
[ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], socket 
1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
[./././././././././././././././././.][././././././././B/B/B/B/./././././.]
[ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], socket 
1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: 
[./././././././././././././././././.][././././././././././././B/B/B/B/./.]
[ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
[B/B/B/B/./././././././././././././.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
[././././B/B/B/B/./././././././././.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket 
0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
[././././././././B/B/B/B/./././././.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], socket 
0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
[././././././././././././B/B/B/B/./.][./././././././././././././././././.]
[ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], socket 
1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
[./././././././././././././././././.][B/B/B/B/./././././././././././././.]
[ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], socket 
1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
[./././././././././././././././././.][././././B/B/B/B/./././././././././.]
[ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], socket 
1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
[./././././././././././././././././.][././././././././B/B/B/B/./././././.]
[ba001.localdomain:11700] MCW rank 7 bound to socket 1[core 30[hwt 0]], socket 
1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: 
[./././././././././././././././././.][././././././././././././B/B/B/B/./.]


I would have expected the first command to work in the sense that processes are 
at least mapped and bound somewhere across the two nodes; is there a particular 
reason why that doesn't happen?


I am using Open MPI 3.1.2 in the above examples with only