Re: [OMPI devel] mpirun error when not using span
I believe the problem is actually a little different than you described. The issue occurs whenever the #procs combined with PE exceeds the number of cores on a node. It is caused by the fact that we aren’t considering the PE number when mapping processes - we only appear to be looking at it when binding. I’ll try to poke at it a bit. > On Sep 11, 2018, at 9:17 AM, Shrader, David Lee wrote: > > Here's the xml output from lstopo. Thank you for taking a look! > David > > From: devel on behalf of Ralph H Castain > > Sent: Monday, September 10, 2018 5:12 PM > To: OpenMPI Devel > Subject: Re: [OMPI devel] mpirun error when not using span > > Could you please send the output from “lstopo --of xml foo.xml” (the file > foo.xml) so I can try to replicate here? > > >> On Sep 4, 2018, at 12:35 PM, Shrader, David Lee > <mailto:dshra...@lanl.gov>> wrote: >> >> Hello, >> >> I have run this issue by Howard, and he asked me to forward it on to the >> Open MPI devel mailing list. I get an error when trying to use PE=n with >> '--map-by numa' and not using span when using more than one node: >> >> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to >> core --report-bindings true >> -- >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >>Bind to: CORE >>Node:ba001 >>#processes: 2 >>#cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> -- >> >> The absolute values of the numbers passed to -n and PE don't really matter; >> the error pops up as soon as those numbers are combined in such a way that >> an MPI rank ends up on the second node. >> >> If I add the "span" parameter, everything works as expected: >> >> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span >> --bind-to core --report-bindings true >> [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], >> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: >> [B/B/B/B/./././././././././././././.][./././././././././././././././././.] >> [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], >> socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: >> [././././B/B/B/B/./././././././././.][./././././././././././././././././.] >> [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], >> socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: >> [././././././././B/B/B/B/./././././.][./././././././././././././././././.] >> [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], >> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt >> 0]]: >> [././././././././././././B/B/B/B/./.][./././././././././././././././././.] >> [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], >> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt >> 0]]: >> [./././././././././././././././././.][B/B/B/B/./././././././././././././.] >> [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], >> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt >> 0]]: >> [./././././././././././././././././.][././././B/B/B/B/./././././././././.] >> [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], >> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt >> 0]]: >> [./././././././././././././././././.][././././././././B/B/B/B/./././././.] >> [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], >> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt >> 0]]: >> [./././././././././././././././././.][././././././././././././B/B/B/B/./.] >> [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], >> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: >> [B/B/B/B/./././././././././././././.][./././././././././././././././././.] >> [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], >> socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: >> [././././B/B/B/B/./././././././././.][./././././././././././././././././.] >> [ba001.localdomain:11700] MCW rank 2 bound
Re: [OMPI devel] mpirun error when not using span
Here's the xml output from lstopo. Thank you for taking a look! David From: devel on behalf of Ralph H Castain Sent: Monday, September 10, 2018 5:12 PM To: OpenMPI Devel Subject: Re: [OMPI devel] mpirun error when not using span Could you please send the output from “lstopo --of xml foo.xml” (the file foo.xml) so I can try to replicate here? On Sep 4, 2018, at 12:35 PM, Shrader, David Lee mailto:dshra...@lanl.gov>> wrote: Hello, I have run this issue by Howard, and he asked me to forward it on to the Open MPI devel mailing list. I get an error when trying to use PE=n with '--map-by numa' and not using span when using more than one node: [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to core --report-bindings true -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:ba001 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -- The absolute values of the numbers passed to -n and PE don't really matter; the error pops up as soon as those numbers are combined in such a way that an MPI rank ends up on the second node. If I add the "span" parameter, everything works as expected: [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span --bind-to core --report-bindings true [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [B/B/B/B/./././././././././././././.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././B/B/B/B/./././././././././.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [././././././././B/B/B/B/./././././.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: [././././././././././././B/B/B/B/./.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: [./././././././././././././././././.][B/B/B/B/./././././././././././././.] [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: [./././././././././././././././././.][././././B/B/B/B/./././././././././.] [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: [./././././././././././././././././.][././././././././B/B/B/B/./././././.] [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: [./././././././././././././././././.][././././././././././././B/B/B/B/./.] [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [B/B/B/B/./././././././././././././.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././B/B/B/B/./././././././././.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [././././././././B/B/B/B/./././././.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: [././././././././././././B/B/B/B/./.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: [./././././././././././././././././.][B/B/B/B/./././././././././././././.] [ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: [./././././././././././././././././.][././././B/B/B/B/./././././././././.] [ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: [./././././././././././././././././
Re: [OMPI devel] mpirun error when not using span
Could you please send the output from “lstopo --of xml foo.xml” (the file foo.xml) so I can try to replicate here? > On Sep 4, 2018, at 12:35 PM, Shrader, David Lee wrote: > > Hello, > > I have run this issue by Howard, and he asked me to forward it on to the Open > MPI devel mailing list. I get an error when trying to use PE=n with '--map-by > numa' and not using span when using more than one node: > > [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to > core --report-bindings true > -- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > >Bind to: CORE >Node:ba001 >#processes: 2 >#cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -- > > The absolute values of the numbers passed to -n and PE don't really matter; > the error pops up as soon as those numbers are combined in such a way that an > MPI rank ends up on the second node. > > If I add the "span" parameter, everything works as expected: > > [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span > --bind-to core --report-bindings true > [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: > [B/B/B/B/./././././././././././././.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: > [././././B/B/B/B/./././././././././.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], > socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: > [././././././././B/B/B/B/./././././.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], > socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: > [././././././././././././B/B/B/B/./.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], > socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: > [./././././././././././././././././.][B/B/B/B/./././././././././././././.] > [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], > socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: > [./././././././././././././././././.][././././B/B/B/B/./././././././././.] > [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], > socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: > [./././././././././././././././././.][././././././././B/B/B/B/./././././.] > [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], > socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: > [./././././././././././././././././.][././././././././././././B/B/B/B/./.] > [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: > [B/B/B/B/./././././././././././././.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: > [././././B/B/B/B/./././././././././.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket > 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: > [././././././././B/B/B/B/./././././.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], > socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: > [././././././././././././B/B/B/B/./.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], > socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: > [./././././././././././././././././.][B/B/B/B/./././././././././././././.] > [ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], > socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: > [./././././././././././././././././.][././././B/B/B/B/./././././././././.] > [ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], > socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: > [./././././././././././././././././.][././././././././B/B/B/B/./././././.] > [ba001.localdomain:11700] MCW rank 7 bound to socket 1[core 30[hwt 0]], > socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: > [.
[OMPI devel] mpirun error when not using span
Hello, I have run this issue by Howard, and he asked me to forward it on to the Open MPI devel mailing list. I get an error when trying to use PE=n with '--map-by numa' and not using span when using more than one node: [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to core --report-bindings true -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:ba001 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -- The absolute values of the numbers passed to -n and PE don't really matter; the error pops up as soon as those numbers are combined in such a way that an MPI rank ends up on the second node. If I add the "span" parameter, everything works as expected: [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span --bind-to core --report-bindings true [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [B/B/B/B/./././././././././././././.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././B/B/B/B/./././././././././.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [././././././././B/B/B/B/./././././.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: [././././././././././././B/B/B/B/./.][./././././././././././././././././.] [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: [./././././././././././././././././.][B/B/B/B/./././././././././././././.] [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: [./././././././././././././././././.][././././B/B/B/B/./././././././././.] [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: [./././././././././././././././././.][././././././././B/B/B/B/./././././.] [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: [./././././././././././././././././.][././././././././././././B/B/B/B/./.] [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [B/B/B/B/./././././././././././././.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././B/B/B/B/./././././././././.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: [././././././././B/B/B/B/./././././.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: [././././././././././././B/B/B/B/./.][./././././././././././././././././.] [ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: [./././././././././././././././././.][B/B/B/B/./././././././././././././.] [ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: [./././././././././././././././././.][././././B/B/B/B/./././././././././.] [ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: [./././././././././././././././././.][././././././././B/B/B/B/./././././.] [ba001.localdomain:11700] MCW rank 7 bound to socket 1[core 30[hwt 0]], socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: [./././././././././././././././././.][././././././././././././B/B/B/B/./.] I would have expected the first command to work in the sense that processes are at least mapped and bound somewhere across the two nodes; is there a particular reason why that doesn't happen? I am using Open MPI 3.1.2 in the above examples with only