Re: [OMPI users] local rank to rank comms

2019-03-20 Thread Michael Di Domenico
unfortunately it takes a while to export the data, but here's what i see

On Mon, Mar 11, 2019 at 11:02 PM Gilles Gouaillardet  wrote:
>
> Michael,
>
>
> this is odd, I will have a look.
>
> Can you confirm you are running on a single node ?
>
>
> At first, you need to understand which component is used by Open MPI for
> communications.
>
> There are several options here, and since I do not know how Open MPI was
> built, nor which dependencies are installed,
>
> I can only list a few
>
>
> - pml/cm uses mtl/psm2 => omnipath is used for both inter and intra node
> communications
>
> - pml/cm uses mtl/ofi => libfabric is used for both inter and intra node
> communications. it definitely uses libpsm2 for inter node
> communications, and I do not know enough about the internals to tell how
> inter communications are handled
>
> - pml/ob1 is used, I guess it uses btl/ofi for inter node communications
> and btl/vader for intra node communications (in that case the NIC device
> is not used for intra node communications
>
> there could be other I am missing (does UCX support OmniPath ? could
> btl/ofi also be used for intra node communications ?)
>
>
> mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca
> mtl_base_verbose 10 ...
>
> should tell you what is used (feel free to compress and post the full
> output if you have some hard time understanding the logs)
>
>
> Cheers,
>
>
> Gilles
>
> On 3/12/2019 1:41 AM, Michael Di Domenico wrote:
> > On Mon, Mar 11, 2019 at 12:09 PM Gilles Gouaillardet
> >  wrote:
> >> You can force
> >> mpirun --mca pml ob1 ...
> >> And btl/vader (shared memory) will be used for intra node communications 
> >> ... unless MPI tasks are from different jobs (read MPI_Comm_spawn())
> > if i run
> >
> > mpirun -n 16 IMB-MPI1 alltoallv
> > things run fine, 12us on average for all ranks
> >
> > if i run
> >
> > mpirun -n 16 --mca pml ob1 IMB-MPI1 alltoallv
> > the program runs, but then it hangs at "List of benchmarks to run:
> > #Alltoallv"  and no tests run
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


ompi.run.ob1
Description: Binary data


ompi.run.cm
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Gilles Gouaillardet

Michael,


this is odd, I will have a look.

Can you confirm you are running on a single node ?


At first, you need to understand which component is used by Open MPI for 
communications.


There are several options here, and since I do not know how Open MPI was 
built, nor which dependencies are installed,


I can only list a few


- pml/cm uses mtl/psm2 => omnipath is used for both inter and intra node 
communications


- pml/cm uses mtl/ofi => libfabric is used for both inter and intra node 
communications. it definitely uses libpsm2 for inter node 
communications, and I do not know enough about the internals to tell how 
inter communications are handled


- pml/ob1 is used, I guess it uses btl/ofi for inter node communications 
and btl/vader for intra node communications (in that case the NIC device 
is not used for intra node communications


there could be other I am missing (does UCX support OmniPath ? could 
btl/ofi also be used for intra node communications ?)



mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca 
mtl_base_verbose 10 ...


should tell you what is used (feel free to compress and post the full 
output if you have some hard time understanding the logs)



Cheers,


Gilles

On 3/12/2019 1:41 AM, Michael Di Domenico wrote:

On Mon, Mar 11, 2019 at 12:09 PM Gilles Gouaillardet
 wrote:

You can force
mpirun --mca pml ob1 ...
And btl/vader (shared memory) will be used for intra node communications ... 
unless MPI tasks are from different jobs (read MPI_Comm_spawn())

if i run

mpirun -n 16 IMB-MPI1 alltoallv
things run fine, 12us on average for all ranks

if i run

mpirun -n 16 --mca pml ob1 IMB-MPI1 alltoallv
the program runs, but then it hangs at "List of benchmarks to run:
#Alltoallv"  and no tests run
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
On Mon, Mar 11, 2019 at 12:09 PM Gilles Gouaillardet
 wrote:
> You can force
> mpirun --mca pml ob1 ...
> And btl/vader (shared memory) will be used for intra node communications ... 
> unless MPI tasks are from different jobs (read MPI_Comm_spawn())

if i run

mpirun -n 16 IMB-MPI1 alltoallv
things run fine, 12us on average for all ranks

if i run

mpirun -n 16 --mca pml ob1 IMB-MPI1 alltoallv
the program runs, but then it hangs at "List of benchmarks to run:
#Alltoallv"  and no tests run
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
On Mon, Mar 11, 2019 at 12:19 PM Ralph H Castain  wrote:
> OFI uses libpsm2 underneath it when omnipath detected
>
> > On Mar 11, 2019, at 9:06 AM, Gilles Gouaillardet 
> >  wrote:
> > It might show that pml/cm and mtl/psm2 are used. In that case, then yes, 
> > the OmniPath library is used even for intra node communications. If this 
> > library is optimized for intra node, then it will internally uses shared 
> > memory instead of the NIC.

would it be fair to assume that, if we assume the opa library is
optimized for intra-node using shared memory, there shouldn't be much
of a difference between the opa library and the ompi library for local
rank to rank comms

is there a way or tool to measure that?  i'd like to run the tests
toggling opa vs ompi libraries and see if or really how much a
difference there is
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Ralph H Castain
OFI uses libpsm2 underneath it when omnipath detected 

Sent from my iPhone

> On Mar 11, 2019, at 9:06 AM, Gilles Gouaillardet 
>  wrote:
> 
> Michael,
> 
> You can
> 
> mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca 
> mtl_base_verbose 10 ...
> 
> It might show that pml/cm and mtl/psm2 are used. In that case, then yes, the 
> OmniPath library is used even for intra node communications. If this library 
> is optimized for intra node, then it will internally uses shared memory 
> instead of the NIC.
> 
> 
> You can force
> 
> mpirun --mca pml ob1 ...
> 
> 
> And btl/vader (shared memory) will be used for intra node communications ... 
> unless MPI tasks are from different jobs (read MPI_Comm_spawn())
> 
> Cheers,
> 
> Gilles
> 
> Michael Di Domenico  wrote:
>> i have a user that's claiming when two ranks on the same node want to
>> talk with each other, they're using the NIC to talk rather then just
>> talking directly.
>> 
>> i've never had to test such a scenario.  is there a way for me to
>> prove one way or another whether two ranks are talking through say the
>> kernel (or however it actually works) or using the nic?
>> 
>> i didn't set any flags when i compiled openmpi to change this.
>> 
>> i'm running ompi 3.1, pmix 2.2.1, and slurm 18.05 running atop omnipath
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Gilles Gouaillardet
Michael,

You can

mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca 
mtl_base_verbose 10 ...

It might show that pml/cm and mtl/psm2 are used. In that case, then yes, the 
OmniPath library is used even for intra node communications. If this library is 
optimized for intra node, then it will internally uses shared memory instead of 
the NIC.


You can force

mpirun --mca pml ob1 ...


And btl/vader (shared memory) will be used for intra node communications ... 
unless MPI tasks are from different jobs (read MPI_Comm_spawn())

Cheers,

Gilles

Michael Di Domenico  wrote:
>i have a user that's claiming when two ranks on the same node want to
>talk with each other, they're using the NIC to talk rather then just
>talking directly.
>
>i've never had to test such a scenario.  is there a way for me to
>prove one way or another whether two ranks are talking through say the
>kernel (or however it actually works) or using the nic?
>
>i didn't set any flags when i compiled openmpi to change this.
>
>i'm running ompi 3.1, pmix 2.2.1, and slurm 18.05 running atop omnipath
>___
>users mailing list
>users@lists.open-mpi.org
>https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
On Mon, Mar 11, 2019 at 11:51 AM Ralph H Castain  wrote:
> You are probably using the ofi mtl - could be psm2 uses loopback method?

according to ompi_info i do in fact have mtl's ofi,psm,psm2.  i
haven't changed any of the defaults, so are you saying order to change
the behaviour i have to run mpirun --mca mtl psm2?  if true, what's
the recourse to not using the ofi mtl?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] local rank to rank comms

2019-03-11 Thread Ralph H Castain
You are probably using the ofi mtl - could be psm2 uses loopback method?

Sent from my iPhone

> On Mar 11, 2019, at 8:40 AM, Michael Di Domenico  
> wrote:
> 
> i have a user that's claiming when two ranks on the same node want to
> talk with each other, they're using the NIC to talk rather then just
> talking directly.
> 
> i've never had to test such a scenario.  is there a way for me to
> prove one way or another whether two ranks are talking through say the
> kernel (or however it actually works) or using the nic?
> 
> i didn't set any flags when i compiled openmpi to change this.
> 
> i'm running ompi 3.1, pmix 2.2.1, and slurm 18.05 running atop omnipath
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] local rank to rank comms

2019-03-11 Thread Michael Di Domenico
i have a user that's claiming when two ranks on the same node want to
talk with each other, they're using the NIC to talk rather then just
talking directly.

i've never had to test such a scenario.  is there a way for me to
prove one way or another whether two ranks are talking through say the
kernel (or however it actually works) or using the nic?

i didn't set any flags when i compiled openmpi to change this.

i'm running ompi 3.1, pmix 2.2.1, and slurm 18.05 running atop omnipath
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users