Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-12 Thread Bertini, Denis Dr. via users
Hi George,

May be i did not explain properly the point i wanted to be clarified.

What i was tring to say is that i had the impression that MPI_T has been 
developped for tuning the internal MPI parameter for a

defined underlying network fabric.

Furthermore it is mainly used by MPI developer to  do some debugging / 
improvement  right?

Or is it meant for users too?

If it is the case,  how as user can i make use of  MPI_T interface to find out 
if the fabric itself ( system ) has a problem ( and not the MPI implementation 
on top of it)?

Cheers,

Denis




From: George Bosilca 
Sent: Saturday, February 12, 2022 7:38:02 AM
To: Bertini, Denis Dr.
Cc: Open MPI Users; Joseph Schuchart
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

I am not sure I understand the comment about MPI_T.

Each network card has internal counters that can be gathered by any process on 
the node. Similarly, some information is available from the switches, but I 
always assumed that information is aggregated across all ongoing jobs. But, 
merging the switch-level information with the MPI level the necessary trend can 
be highlighted.

  George.


On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. 
mailto:d.bert...@gsi.de>> wrote:

May be i am wrong, but the MPI_T seems to aim to internal openMPI parameters 
right?


So with which kind of magic a tool like OSU INAM can get info from network 
fabric and even

switches related to a particular MPI job ...


There should be more info gathered in the background 



From: George Bosilca mailto:bosi...@icl.utk.edu>>
Sent: Friday, February 11, 2022 4:25:42 PM
To: Open MPI Users
Cc: Joseph Schuchart; Bertini, Denis Dr.
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Collecting data during execution is possible in OMPI either with an external 
tool, such as mpiP, or the internal infrastructure, SPC. Take a look at 
./examples/spc_example.c or ./test/spc/spc_test.c to see how to use this.

  George.


On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users 
mailto:users@lists.open-mpi.org>> wrote:

I have seen in OSU INAM paper:

"
While we chose MVAPICH2 for implementing our designs, any MPI
runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection 
and
transmission.
"

But i do not know what it is meant with "modified" openMPI ?


Cheers,

Denis



From: Joseph Schuchart mailto:schuch...@icl.utk.edu>>
Sent: Friday, February 11, 2022 3:02:36 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
with other MPI implementations? Would be worth investigating...

Joseph

On 2/11/22 06:54, Bertini, Denis Dr. wrote:
>
> Hi Joseph
>
> Looking at the MVAPICH i noticed that, in this MPI implementation
>
> a Infiniband Network Analysis  and Profiling Tool  is provided:
>
>
> OSU-INAM
>
>
> Is there something equivalent using openMPI ?
>
> Best
>
> Denis
>
>
> 
> *From:* users 
> mailto:users-boun...@lists.open-mpi.org>> 
> on behalf of Joseph
> Schuchart via users 
> mailto:users@lists.open-mpi.org>>
> *Sent:* Tuesday, February 8, 2022 4:02:53 PM
> *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
> *Cc:* Joseph Schuchart
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> Hi Denis,
>
> Sorry if I missed it in your previous messages but could you also try
> running a different MPI implementation (MVAPICH) to see whether Open MPI
> is at fault or the system is somehow to blame for it?
>
> Thanks
> Joseph
>
> On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
> >
> > Hi
> >
> > Thanks for all these informations !
> >
> >
> > But i have to confess that in this multi-tuning-parameter space,
> >
> > i got somehow lost.
> >
> > Furthermore it is somtimes mixing between user-space and kernel-space.
> >
> > I have only possibility to act on the user space.
> >
> >
> > 1) So i have on the system max locked memory:
> >
> > - ulimit -l unlimited (default )
> >
> >   and i do not see any warnings/errors related to that when
> launching MPI.
> >
> >
> > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> > drop in
> >
> > bw for size=16384
> >
> >
> > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
> >
> > the same behaviour.
> >
>

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread George Bosilca via users
I am not sure I understand the comment about MPI_T.

Each network card has internal counters that can be gathered by any process
on the node. Similarly, some information is available from the switches,
but I always assumed that information is aggregated across all ongoing
jobs. But, merging the switch-level information with the MPI level the
necessary trend can be highlighted.

  George.


On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. 
wrote:

> May be i am wrong, but the MPI_T seems to aim to internal openMPI
> parameters right?
>
>
> So with which kind of magic a tool like OSU INAM can get info from network
> fabric and even
>
> switches related to a particular MPI job ...
>
>
> There should be more info gathered in the background 
>
>
> --
> *From:* George Bosilca 
> *Sent:* Friday, February 11, 2022 4:25:42 PM
> *To:* Open MPI Users
> *Cc:* Joseph Schuchart; Bertini, Denis Dr.
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
>
> Collecting data during execution is possible in OMPI either with an
> external tool, such as mpiP, or the internal infrastructure, SPC. Take a
> look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use
> this.
>
>   George.
>
>
> On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users <
> users@lists.open-mpi.org> wrote:
>
>> I have seen in OSU INAM paper:
>>
>>
>> "
>> While we chose MVAPICH2 for implementing our designs, any MPI
>> runtime (e.g.: OpenMPI [12]) can be modified to perform similar data
>> collection and
>> transmission.
>> "
>>
>> But i do not know what it is meant with "modified" openMPI ?
>>
>>
>> Cheers,
>>
>> Denis
>>
>>
>> --
>> *From:* Joseph Schuchart 
>> *Sent:* Friday, February 11, 2022 3:02:36 PM
>> *To:* Bertini, Denis Dr.; Open MPI Users
>> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
>> network
>>
>> I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
>> with other MPI implementations? Would be worth investigating...
>>
>> Joseph
>>
>> On 2/11/22 06:54, Bertini, Denis Dr. wrote:
>> >
>> > Hi Joseph
>> >
>> > Looking at the MVAPICH i noticed that, in this MPI implementation
>> >
>> > a Infiniband Network Analysis  and Profiling Tool  is provided:
>> >
>> >
>> > OSU-INAM
>> >
>> >
>> > Is there something equivalent using openMPI ?
>> >
>> > Best
>> >
>> > Denis
>> >
>> >
>> > 
>> > *From:* users  on behalf of Joseph
>> > Schuchart via users 
>> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM
>> > *To:* users@lists.open-mpi.org
>> > *Cc:* Joseph Schuchart
>> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
>> > Infiniband network
>> > Hi Denis,
>> >
>> > Sorry if I missed it in your previous messages but could you also try
>> > running a different MPI implementation (MVAPICH) to see whether Open MPI
>> > is at fault or the system is somehow to blame for it?
>> >
>> > Thanks
>> > Joseph
>> >
>> > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
>> > >
>> > > Hi
>> > >
>> > > Thanks for all these informations !
>> > >
>> > >
>> > > But i have to confess that in this multi-tuning-parameter space,
>> > >
>> > > i got somehow lost.
>> > >
>> > > Furthermore it is somtimes mixing between user-space and kernel-space.
>> > >
>> > > I have only possibility to act on the user space.
>> > >
>> > >
>> > > 1) So i have on the system max locked memory:
>> > >
>> > > - ulimit -l unlimited (default )
>> > >
>> > >   and i do not see any warnings/errors related to that when
>> > launching MPI.
>> > >
>> > >
>> > > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
>> > > drop in
>> > >
>> > > bw for size=16384
>> > >
>> > >
>> > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
>> > >
>> > > the same behaviour.
>> > >
>> > >
>

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread Bertini, Denis Dr. via users
May be i am wrong, but the MPI_T seems to aim to internal openMPI parameters 
right?


So with which kind of magic a tool like OSU INAM can get info from network 
fabric and even

switches related to a particular MPI job ...


There should be more info gathered in the background 



From: George Bosilca 
Sent: Friday, February 11, 2022 4:25:42 PM
To: Open MPI Users
Cc: Joseph Schuchart; Bertini, Denis Dr.
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Collecting data during execution is possible in OMPI either with an external 
tool, such as mpiP, or the internal infrastructure, SPC. Take a look at 
./examples/spc_example.c or ./test/spc/spc_test.c to see how to use this.

  George.


On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users 
mailto:users@lists.open-mpi.org>> wrote:

I have seen in OSU INAM paper:

"
While we chose MVAPICH2 for implementing our designs, any MPI
runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection 
and
transmission.
"

But i do not know what it is meant with "modified" openMPI ?


Cheers,

Denis



From: Joseph Schuchart mailto:schuch...@icl.utk.edu>>
Sent: Friday, February 11, 2022 3:02:36 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
with other MPI implementations? Would be worth investigating...

Joseph

On 2/11/22 06:54, Bertini, Denis Dr. wrote:
>
> Hi Joseph
>
> Looking at the MVAPICH i noticed that, in this MPI implementation
>
> a Infiniband Network Analysis  and Profiling Tool  is provided:
>
>
> OSU-INAM
>
>
> Is there something equivalent using openMPI ?
>
> Best
>
> Denis
>
>
> 
> *From:* users 
> mailto:users-boun...@lists.open-mpi.org>> 
> on behalf of Joseph
> Schuchart via users 
> mailto:users@lists.open-mpi.org>>
> *Sent:* Tuesday, February 8, 2022 4:02:53 PM
> *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
> *Cc:* Joseph Schuchart
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> Hi Denis,
>
> Sorry if I missed it in your previous messages but could you also try
> running a different MPI implementation (MVAPICH) to see whether Open MPI
> is at fault or the system is somehow to blame for it?
>
> Thanks
> Joseph
>
> On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
> >
> > Hi
> >
> > Thanks for all these informations !
> >
> >
> > But i have to confess that in this multi-tuning-parameter space,
> >
> > i got somehow lost.
> >
> > Furthermore it is somtimes mixing between user-space and kernel-space.
> >
> > I have only possibility to act on the user space.
> >
> >
> > 1) So i have on the system max locked memory:
> >
> > - ulimit -l unlimited (default )
> >
> >   and i do not see any warnings/errors related to that when
> launching MPI.
> >
> >
> > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> > drop in
> >
> > bw for size=16384
> >
> >
> > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
> >
> > the same behaviour.
> >
> >
> > 3) i realized that increasing the so-called warm up parameter  in the
> >
> > OSU benchmark (argument -x 200 as default) the discrepancy.
> >
> > At the contrary putting lower threshold ( -x 10 ) can increase this BW
> >
> > discrepancy up to factor 300 at message size 16384 compare to
> >
> > message size 8192 for example.
> >
> > So does it means that there are some caching effects
> >
> > in the internode communication?
> >
> >
> > From my experience, to tune parameters is a time-consuming and
> cumbersome
> >
> > task.
> >
> >
> > Could it also be the problem is not really on the openMPI
> > implemenation but on the
> >
> > system?
> >
> >
> > Best
> >
> > Denis
> >
> > 
> > *From:* users 
> > mailto:users-boun...@lists.open-mpi.org>> 
> > on behalf of Gus
> > Correa via users mailto:users@lists.open-mpi.org>>
> > *Sent:* Monday, February 7, 2022 9:14:19 PM
> > *To:* Open MPI Users
> > *Cc:* Gus Correa
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> > Infiniban

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread George Bosilca via users
Collecting data during execution is possible in OMPI either with an
external tool, such as mpiP, or the internal infrastructure, SPC. Take a
look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use
this.

  George.


On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users <
users@lists.open-mpi.org> wrote:

> I have seen in OSU INAM paper:
>
>
> "
> While we chose MVAPICH2 for implementing our designs, any MPI
> runtime (e.g.: OpenMPI [12]) can be modified to perform similar data
> collection and
> transmission.
> "
>
> But i do not know what it is meant with "modified" openMPI ?
>
>
> Cheers,
>
> Denis
>
>
> --
> *From:* Joseph Schuchart 
> *Sent:* Friday, February 11, 2022 3:02:36 PM
> *To:* Bertini, Denis Dr.; Open MPI Users
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
>
> I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
> with other MPI implementations? Would be worth investigating...
>
> Joseph
>
> On 2/11/22 06:54, Bertini, Denis Dr. wrote:
> >
> > Hi Joseph
> >
> > Looking at the MVAPICH i noticed that, in this MPI implementation
> >
> > a Infiniband Network Analysis  and Profiling Tool  is provided:
> >
> >
> > OSU-INAM
> >
> >
> > Is there something equivalent using openMPI ?
> >
> > Best
> >
> > Denis
> >
> >
> > ----------------
> > *From:* users  on behalf of Joseph
> > Schuchart via users 
> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM
> > *To:* users@lists.open-mpi.org
> > *Cc:* Joseph Schuchart
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> > Infiniband network
> > Hi Denis,
> >
> > Sorry if I missed it in your previous messages but could you also try
> > running a different MPI implementation (MVAPICH) to see whether Open MPI
> > is at fault or the system is somehow to blame for it?
> >
> > Thanks
> > Joseph
> >
> > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
> > >
> > > Hi
> > >
> > > Thanks for all these informations !
> > >
> > >
> > > But i have to confess that in this multi-tuning-parameter space,
> > >
> > > i got somehow lost.
> > >
> > > Furthermore it is somtimes mixing between user-space and kernel-space.
> > >
> > > I have only possibility to act on the user space.
> > >
> > >
> > > 1) So i have on the system max locked memory:
> > >
> > > - ulimit -l unlimited (default )
> > >
> > >   and i do not see any warnings/errors related to that when
> > launching MPI.
> > >
> > >
> > > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> > > drop in
> > >
> > > bw for size=16384
> > >
> > >
> > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
> > >
> > > the same behaviour.
> > >
> > >
> > > 3) i realized that increasing the so-called warm up parameter  in the
> > >
> > > OSU benchmark (argument -x 200 as default) the discrepancy.
> > >
> > > At the contrary putting lower threshold ( -x 10 ) can increase this BW
> > >
> > > discrepancy up to factor 300 at message size 16384 compare to
> > >
> > > message size 8192 for example.
> > >
> > > So does it means that there are some caching effects
> > >
> > > in the internode communication?
> > >
> > >
> > > From my experience, to tune parameters is a time-consuming and
> > cumbersome
> > >
> > > task.
> > >
> > >
> > > Could it also be the problem is not really on the openMPI
> > > implemenation but on the
> > >
> > > system?
> > >
> > >
> > > Best
> > >
> > > Denis
> > >
> > >
> 
> > > *From:* users  on behalf of Gus
> > > Correa via users 
> > > *Sent:* Monday, February 7, 2022 9:14:19 PM
> > > *To:* Open MPI Users
> > > *Cc:* Gus Correa
> > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> > > Infiniband network
> > > This may have changed since, but these used to be relevant points.
> > > Overall, the Ope

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread Bertini, Denis Dr. via users
I have seen in OSU INAM paper:

"
While we chose MVAPICH2 for implementing our designs, any MPI
runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection 
and
transmission.
"

But i do not know what it is meant with "modified" openMPI ?


Cheers,

Denis



From: Joseph Schuchart 
Sent: Friday, February 11, 2022 3:02:36 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
with other MPI implementations? Would be worth investigating...

Joseph

On 2/11/22 06:54, Bertini, Denis Dr. wrote:
>
> Hi Joseph
>
> Looking at the MVAPICH i noticed that, in this MPI implementation
>
> a Infiniband Network Analysis  and Profiling Tool  is provided:
>
>
> OSU-INAM
>
>
> Is there something equivalent using openMPI ?
>
> Best
>
> Denis
>
>
> 
> *From:* users  on behalf of Joseph
> Schuchart via users 
> *Sent:* Tuesday, February 8, 2022 4:02:53 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Joseph Schuchart
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> Hi Denis,
>
> Sorry if I missed it in your previous messages but could you also try
> running a different MPI implementation (MVAPICH) to see whether Open MPI
> is at fault or the system is somehow to blame for it?
>
> Thanks
> Joseph
>
> On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
> >
> > Hi
> >
> > Thanks for all these informations !
> >
> >
> > But i have to confess that in this multi-tuning-parameter space,
> >
> > i got somehow lost.
> >
> > Furthermore it is somtimes mixing between user-space and kernel-space.
> >
> > I have only possibility to act on the user space.
> >
> >
> > 1) So i have on the system max locked memory:
> >
> > - ulimit -l unlimited (default )
> >
> >   and i do not see any warnings/errors related to that when
> launching MPI.
> >
> >
> > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> > drop in
> >
> > bw for size=16384
> >
> >
> > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
> >
> > the same behaviour.
> >
> >
> > 3) i realized that increasing the so-called warm up parameter  in the
> >
> > OSU benchmark (argument -x 200 as default) the discrepancy.
> >
> > At the contrary putting lower threshold ( -x 10 ) can increase this BW
> >
> > discrepancy up to factor 300 at message size 16384 compare to
> >
> > message size 8192 for example.
> >
> > So does it means that there are some caching effects
> >
> > in the internode communication?
> >
> >
> > From my experience, to tune parameters is a time-consuming and
> cumbersome
> >
> > task.
> >
> >
> > Could it also be the problem is not really on the openMPI
> > implemenation but on the
> >
> > system?
> >
> >
> > Best
> >
> > Denis
> >
> > 
> > *From:* users  on behalf of Gus
> > Correa via users 
> > *Sent:* Monday, February 7, 2022 9:14:19 PM
> > *To:* Open MPI Users
> > *Cc:* Gus Correa
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> > Infiniband network
> > This may have changed since, but these used to be relevant points.
> > Overall, the Open MPI FAQ have lots of good suggestions:
> > https://www.open-mpi.org/faq/
> > some specific for performance tuning:
> > https://www.open-mpi.org/faq/?category=tuning
> > https://www.open-mpi.org/faq/?category=openfabrics
> >
> > 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> > available in compute nodes:
> > mpirun  --mca btl self,sm,openib  ...
> >
> > https://www.open-mpi.org/faq/?category=tuning#selecting-components
> >
> > However, this may have changed lately:
> > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> > 2) Maximum locked memory used by IB and their system limit. Start
> > here:
> >
> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
> > 3) The eager vs. rendezvous message size threshold. I wonder if it may
> > sit right where you see the latency spike.
> > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
> &

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread Joseph Schuchart via users
I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work 
with other MPI implementations? Would be worth investigating...


Joseph

On 2/11/22 06:54, Bertini, Denis Dr. wrote:


Hi Joseph

Looking at the MVAPICH i noticed that, in this MPI implementation

a Infiniband Network Analysis  and Profiling Tool  is provided:


OSU-INAM


Is there something equivalent using openMPI ?

Best

Denis



*From:* users  on behalf of Joseph 
Schuchart via users 

*Sent:* Tuesday, February 8, 2022 4:02:53 PM
*To:* users@lists.open-mpi.org
*Cc:* Joseph Schuchart
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking 
Infiniband network

Hi Denis,

Sorry if I missed it in your previous messages but could you also try
running a different MPI implementation (MVAPICH) to see whether Open MPI
is at fault or the system is somehow to blame for it?

Thanks
Joseph

On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
>
> Hi
>
> Thanks for all these informations !
>
>
> But i have to confess that in this multi-tuning-parameter space,
>
> i got somehow lost.
>
> Furthermore it is somtimes mixing between user-space and kernel-space.
>
> I have only possibility to act on the user space.
>
>
> 1) So i have on the system max locked memory:
>
>                         - ulimit -l unlimited (default )
>
>   and i do not see any warnings/errors related to that when 
launching MPI.

>
>
> 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> drop in
>
> bw for size=16384
>
>
> 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
>
> the same behaviour.
>
>
> 3) i realized that increasing the so-called warm up parameter  in the
>
> OSU benchmark (argument -x 200 as default) the discrepancy.
>
> At the contrary putting lower threshold ( -x 10 ) can increase this BW
>
> discrepancy up to factor 300 at message size 16384 compare to
>
> message size 8192 for example.
>
> So does it means that there are some caching effects
>
> in the internode communication?
>
>
> From my experience, to tune parameters is a time-consuming and 
cumbersome

>
> task.
>
>
> Could it also be the problem is not really on the openMPI
> implemenation but on the
>
> system?
>
>
> Best
>
> Denis
>
> ------------
> *From:* users  on behalf of Gus
> Correa via users 
> *Sent:* Monday, February 7, 2022 9:14:19 PM
> *To:* Open MPI Users
> *Cc:* Gus Correa
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> This may have changed since, but these used to be relevant points.
> Overall, the Open MPI FAQ have lots of good suggestions:
> https://www.open-mpi.org/faq/
> some specific for performance tuning:
> https://www.open-mpi.org/faq/?category=tuning
> https://www.open-mpi.org/faq/?category=openfabrics
>
> 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> available in compute nodes:
> mpirun  --mca btl self,sm,openib  ...
>
> https://www.open-mpi.org/faq/?category=tuning#selecting-components
>
> However, this may have changed lately:
> https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> 2) Maximum locked memory used by IB and their system limit. Start
> here:
> 
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage

> 3) The eager vs. rendezvous message size threshold. I wonder if it may
> sit right where you see the latency spike.
> https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
> 4) Processor and memory locality/affinity and binding (please check
> the current options and syntax)
> https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
>
> On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
>  wrote:
>
> Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>
> mpirun --verbose --display-map
>
> Have you tried newer OpenMPI versions?
>
> Do you get similar behavior for the osu_reduce and osu_gather
> benchmarks?
>
> Typically internal buffer sizes as well as your hardware will affect
> performance. Can you give specifications similar to what is
> available at:
> http://mvapich.cse.ohio-state.edu/performance/collectives/
> where the operating system, switch, node type and memory are
> indicated.
>
> If you need good performance, may want to also specify the algorithm
> used. You can find some of the parameters you can tune using:
>
> ompi_info --all
>
> A particular helpful parameter is:
>
> MCA coll tuned: para

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-11 Thread Bertini, Denis Dr. via users
Hi Joseph

Looking at the MVAPICH i noticed that, in this MPI implementation

a Infiniband Network Analysis  and Profiling Tool  is provided:


OSU-INAM


Is there something equivalent using openMPI ?

Best

Denis



From: users  on behalf of Joseph Schuchart 
via users 
Sent: Tuesday, February 8, 2022 4:02:53 PM
To: users@lists.open-mpi.org
Cc: Joseph Schuchart
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Hi Denis,

Sorry if I missed it in your previous messages but could you also try
running a different MPI implementation (MVAPICH) to see whether Open MPI
is at fault or the system is somehow to blame for it?

Thanks
Joseph

On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
>
> Hi
>
> Thanks for all these informations !
>
>
> But i have to confess that in this multi-tuning-parameter space,
>
> i got somehow lost.
>
> Furthermore it is somtimes mixing between user-space and kernel-space.
>
> I have only possibility to act on the user space.
>
>
> 1) So i have on the system max locked memory:
>
> - ulimit -l unlimited (default )
>
>   and i do not see any warnings/errors related to that when launching MPI.
>
>
> 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> drop in
>
> bw for size=16384
>
>
> 4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed
>
> the same behaviour.
>
>
> 3) i realized that increasing the so-called warm up parameter  in the
>
> OSU benchmark (argument -x 200 as default) the discrepancy.
>
> At the contrary putting lower threshold ( -x 10 ) can increase this BW
>
> discrepancy up to factor 300 at message size 16384 compare to
>
> message size 8192 for example.
>
> So does it means that there are some caching effects
>
> in the internode communication?
>
>
> From my experience, to tune parameters is a time-consuming and cumbersome
>
> task.
>
>
> Could it also be the problem is not really on the openMPI
> implemenation but on the
>
> system?
>
>
> Best
>
> Denis
>
> ------------
> *From:* users  on behalf of Gus
> Correa via users 
> *Sent:* Monday, February 7, 2022 9:14:19 PM
> *To:* Open MPI Users
> *Cc:* Gus Correa
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> This may have changed since, but these used to be relevant points.
> Overall, the Open MPI FAQ have lots of good suggestions:
> https://www.open-mpi.org/faq/
> some specific for performance tuning:
> https://www.open-mpi.org/faq/?category=tuning
> https://www.open-mpi.org/faq/?category=openfabrics
>
> 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> available in compute nodes:
> mpirun  --mca btl self,sm,openib  ...
>
> https://www.open-mpi.org/faq/?category=tuning#selecting-components
>
> However, this may have changed lately:
> https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> 2) Maximum locked memory used by IB and their system limit. Start
> here:
> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
> 3) The eager vs. rendezvous message size threshold. I wonder if it may
> sit right where you see the latency spike.
> https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
> 4) Processor and memory locality/affinity and binding (please check
> the current options and syntax)
> https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
>
> On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
>  wrote:
>
> Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>
> mpirun --verbose --display-map
>
> Have you tried newer OpenMPI versions?
>
> Do you get similar behavior for the osu_reduce and osu_gather
> benchmarks?
>
> Typically internal buffer sizes as well as your hardware will affect
> performance. Can you give specifications similar to what is
> available at:
> http://mvapich.cse.ohio-state.edu/performance/collectives/
> where the operating system, switch, node type and memory are
> indicated.
>
> If you need good performance, may want to also specify the algorithm
> used. You can find some of the parameters you can tune using:
>
> ompi_info --all
>
> A particular helpful parameter is:
>
> MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
> value: "ignore", data source: default, level: 5 tuner/detail,
> type: int)
>Which allreduce algorithm is used. Can be
> locked down to any of: 0 ignore,

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-08 Thread Bertini, Denis Dr. via users
Hi


I do not have so much experience with MVAPICH.

Since we work with singularity container, i can create a

container and install this version to compare.


Cheers,

Denis


From: users  on behalf of Joseph Schuchart 
via users 
Sent: Tuesday, February 8, 2022 4:02:53 PM
To: users@lists.open-mpi.org
Cc: Joseph Schuchart
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Hi Denis,

Sorry if I missed it in your previous messages but could you also try
running a different MPI implementation (MVAPICH) to see whether Open MPI
is at fault or the system is somehow to blame for it?

Thanks
Joseph

On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
>
> Hi
>
> Thanks for all these informations !
>
>
> But i have to confess that in this multi-tuning-parameter space,
>
> i got somehow lost.
>
> Furthermore it is somtimes mixing between user-space and kernel-space.
>
> I have only possibility to act on the user space.
>
>
> 1) So i have on the system max locked memory:
>
> - ulimit -l unlimited (default )
>
>   and i do not see any warnings/errors related to that when launching MPI.
>
>
> 2) I tried differents algorithms for MPI_all_reduce op.  all showing
> drop in
>
> bw for size=16384
>
>
> 4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed
>
> the same behaviour.
>
>
> 3) i realized that increasing the so-called warm up parameter  in the
>
> OSU benchmark (argument -x 200 as default) the discrepancy.
>
> At the contrary putting lower threshold ( -x 10 ) can increase this BW
>
> discrepancy up to factor 300 at message size 16384 compare to
>
> message size 8192 for example.
>
> So does it means that there are some caching effects
>
> in the internode communication?
>
>
> From my experience, to tune parameters is a time-consuming and cumbersome
>
> task.
>
>
> Could it also be the problem is not really on the openMPI
> implemenation but on the
>
> system?
>
>
> Best
>
> Denis
>
> ------------
> *From:* users  on behalf of Gus
> Correa via users 
> *Sent:* Monday, February 7, 2022 9:14:19 PM
> *To:* Open MPI Users
> *Cc:* Gus Correa
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
> Infiniband network
> This may have changed since, but these used to be relevant points.
> Overall, the Open MPI FAQ have lots of good suggestions:
> https://www.open-mpi.org/faq/
> some specific for performance tuning:
> https://www.open-mpi.org/faq/?category=tuning
> https://www.open-mpi.org/faq/?category=openfabrics
>
> 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> available in compute nodes:
> mpirun  --mca btl self,sm,openib  ...
>
> https://www.open-mpi.org/faq/?category=tuning#selecting-components
>
> However, this may have changed lately:
> https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> 2) Maximum locked memory used by IB and their system limit. Start
> here:
> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
> 3) The eager vs. rendezvous message size threshold. I wonder if it may
> sit right where you see the latency spike.
> https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
> 4) Processor and memory locality/affinity and binding (please check
> the current options and syntax)
> https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
>
> On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
>  wrote:
>
> Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>
> mpirun --verbose --display-map
>
> Have you tried newer OpenMPI versions?
>
> Do you get similar behavior for the osu_reduce and osu_gather
> benchmarks?
>
> Typically internal buffer sizes as well as your hardware will affect
> performance. Can you give specifications similar to what is
> available at:
> http://mvapich.cse.ohio-state.edu/performance/collectives/
> where the operating system, switch, node type and memory are
> indicated.
>
> If you need good performance, may want to also specify the algorithm
> used. You can find some of the parameters you can tune using:
>
> ompi_info --all
>
> A particular helpful parameter is:
>
> MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
> value: "ignore", data source: default, level: 5 tuner/detail,
> type: int)
>Which allreduce algorithm is used. Can be
> locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-08 Thread Joseph Schuchart via users

Hi Denis,

Sorry if I missed it in your previous messages but could you also try 
running a different MPI implementation (MVAPICH) to see whether Open MPI 
is at fault or the system is somehow to blame for it?


Thanks
Joseph

On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:


Hi

Thanks for all these informations !


But i have to confess that in this multi-tuning-parameter space,

i got somehow lost.

Furthermore it is somtimes mixing between user-space and kernel-space.

I have only possibility to act on the user space.


1) So i have on the system max locked memory:

                        - ulimit -l unlimited (default )

  and i do not see any warnings/errors related to that when launching MPI.


2) I tried differents algorithms for MPI_all_reduce op.  all showing 
drop in


bw for size=16384


4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed

the same behaviour.


3) i realized that increasing the so-called warm up parameter  in the

OSU benchmark (argument -x 200 as default) the discrepancy.

At the contrary putting lower threshold ( -x 10 ) can increase this BW

discrepancy up to factor 300 at message size 16384 compare to

message size 8192 for example.

So does it means that there are some caching effects

in the internode communication?


From my experience, to tune parameters is a time-consuming and cumbersome

task.


Could it also be the problem is not really on the openMPI 
implemenation but on the


system?


Best

Denis


*From:* users  on behalf of Gus 
Correa via users 

*Sent:* Monday, February 7, 2022 9:14:19 PM
*To:* Open MPI Users
*Cc:* Gus Correa
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking 
Infiniband network

This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning
https://www.open-mpi.org/faq/?category=openfabrics

1) Make sure you are not using the Ethernet TCP/IP, which is widely 
available in compute nodes:

mpirun  --mca btl self,sm,openib  ...

https://www.open-mpi.org/faq/?category=tuning#selecting-components

However, this may have changed lately: 
https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
2) Maximum locked memory used by IB and their system limit. Start 
here: 
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
3) The eager vs. rendezvous message size threshold. I wonder if it may 
sit right where you see the latency spike.

https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
4) Processor and memory locality/affinity and binding (please check 
the current options and syntax)

https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4

On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users 
 wrote:


Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather
benchmarks?

Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is
available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are
indicated.

If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:

ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail,
type: int)
                           Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping
(tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
                           Valid values: 0:"ignore",
1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
           MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0",
data
source: default, level: 5 tuner/detail, type: int)

For OpenMPI 4.0, there is a tuning program [2] that might also be
helpful.

[1]

https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-08 Thread Benson Muite via users

Sorry it is float and size output by the benchmark is in bytes:
https://github.com/ROCm-Developer-Tools/OSU_Microbenchmarks/blob/master/mpi/collective/osu_allreduce.c

Data cache is 32 Kb per core:
https://www.cpu-world.com/CPUs/Zen/AMD-EPYC%207571.html
One send array of 16Kb and one receive array of 16Kb should fill this if 
it is used in that manner.


Similar behavior is obtained on Intel Xeon Platinum 8142M (also 32Kb L1 
data cache per core) with OpenMPI 4.0.2


http://hidl.cse.ohio-state.edu/static/media/talks/slide/AWS_SC19_Talk_V2.pdf

On 2/8/22 2:17 PM, Bertini, Denis Dr. wrote:

Hi

Thanks a lot for all the infos !

Very interesting thanks !

We use basically AMP EPYC processor


 >>

vendor_id: AuthenticAMD
cpu family: 23
model: 1
model name: AMD EPYC 7551 32-Core Processor
stepping: 2
microcode: 0x8001250
cpu MHz: 2000.000
cache size: 512 KB
physical id: 1
siblings: 64
core id: 31
cpu cores: 32
apicid: 127
initial apicid: 127
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
 >>

The number of cores could depend on the node though ( 32/64 )
So according to your calculation, the message of 16384 bytes

should fit in?

BTW it is 16384 bytes or 16384 double precision = 16384*8bytes?

Best

Denis



*From:* users  on behalf of Benson 
Muite via users 

*Sent:* Tuesday, February 8, 2022 11:47:18 AM
*To:* users@lists.open-mpi.org
*Cc:* Benson Muite
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband 
network

On 2/8/22 11:06 AM, Bertini, Denis Dr. via users wrote:

Hi

Thanks for all these informations !


But i have to confess that in this multi-tuning-parameter space,

i got somehow lost.

Furthermore it is somtimes mixing between user-space and kernel-space.

I have only possibility to act on the user space.

Ok. If you are doing the test to check your system, you should only tune
for typical applications, rather than for one function call with a
specific message size. As you can change the OpenMPI default settings
for the algorithm used to do all reduce, you may wish to run tests and
choose a setting that will work well for most of your users. You may
also wish to upgrade to OpenMPI 4.1 as default, so perhaps do tests on
that version.



1) So i have on the system max locked memory:

                          - ulimit -l unlimited (default )

    and i do not see any warnings/errors related to that when launching MPI.


2) I tried differents algorithms for MPI_all_reduce op.  all showing drop in

bw for size=16384

The drops are of different magnitude depending on the algorithm used,
default gives worst case latency of 54970.96 us and round robin gives
worst case latency of 4992.04 us for a size of 16384. May be helpful to
indicate what hardware you are using, both for the chip (cache sizes
will be important) and the interconnect. Perhaps try the test on 2 or 4
nodes as well.



4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed

the same behaviour.

This suggests it is the chip, rather than the interconnect.



3) i realized that increasing the so-called warm up parameter  in the

OSU benchmark (argument -x 200 as default) the discrepancy.

At the contrary putting lower threshold ( -x 10 ) can increase this BW

discrepancy up to factor 300 at message size 16384 compare to

message size 8192 for example.

So does it means that there are some caching effects

in the internode communication?



Probably. If you are using AMD 7551P nodes, these have 96K L1 cache per
core. A message of 16384 double precision uses 132K so will not fit in
L1 cache, and a message of 8192 uses 66K and will fit in L1 cache.
Perhaps try the same test on Intel Xeon e52680 nodes or 6248r nodes.

Some relevant studies are:
Zhong, Cao, Bosilica and Dongarra, "Using long vector extensions for MPI
reductions", https://doi.org/10.1016/j.parco.2021.102871 
<https://doi.org/10.1016/j.parco.2021.102871>


Hashmi, Chakraborty, Bayatpour, Subramoni and Panda "Designing Shared
Address Space MPI libraries in the Many-core Era",
https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf 
<https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf>


Saini, Mehrotra, Taylor, Shende and Biswas, "Performance Analysis of
Scientific and Engineering Applications Using MPInside and TAU",
https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf 
<https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf>

The second study by Hashmi et al. focuses on inter node communication,
but has a nice performance model that demonstrates understanding of the
communication pattern. For typical use of MPI on a particular cluster,
such a detailed understanding is likely not necessary. These studies do
also collect hardware performance information.


  From my experience, to tune parameters is a time-consuming and cumbersome

task.


Could it al

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-08 Thread Bertini, Denis Dr. via users
Hi

Thanks a lot for all the infos !

Very interesting thanks !

We use basically AMP EPYC processor


>>

vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD EPYC 7551 32-Core Processor
stepping : 2
microcode : 0x8001250
cpu MHz : 2000.000
cache size : 512 KB
physical id : 1
siblings : 64
core id : 31
cpu cores : 32
apicid : 127
initial apicid : 127
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
>>

The number of cores could depend on the node though ( 32/64 )
So according to your calculation, the message of 16384 bytes

should fit in?

BTW it is 16384 bytes or 16384 double precision = 16384*8bytes?

Best

Denis



From: users  on behalf of Benson Muite via 
users 
Sent: Tuesday, February 8, 2022 11:47:18 AM
To: users@lists.open-mpi.org
Cc: Benson Muite
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

On 2/8/22 11:06 AM, Bertini, Denis Dr. via users wrote:
> Hi
>
> Thanks for all these informations !
>
>
> But i have to confess that in this multi-tuning-parameter space,
>
> i got somehow lost.
>
> Furthermore it is somtimes mixing between user-space and kernel-space.
>
> I have only possibility to act on the user space.
Ok. If you are doing the test to check your system, you should only tune
for typical applications, rather than for one function call with a
specific message size. As you can change the OpenMPI default settings
for the algorithm used to do all reduce, you may wish to run tests and
choose a setting that will work well for most of your users. You may
also wish to upgrade to OpenMPI 4.1 as default, so perhaps do tests on
that version.
>
>
> 1) So i have on the system max locked memory:
>
>  - ulimit -l unlimited (default )
>
>and i do not see any warnings/errors related to that when launching MPI.
>
>
> 2) I tried differents algorithms for MPI_all_reduce op.  all showing drop in
>
> bw for size=16384
The drops are of different magnitude depending on the algorithm used,
default gives worst case latency of 54970.96 us and round robin gives
worst case latency of 4992.04 us for a size of 16384. May be helpful to
indicate what hardware you are using, both for the chip (cache sizes
will be important) and the interconnect. Perhaps try the test on 2 or 4
nodes as well.
>
>
> 4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed
>
> the same behaviour.
This suggests it is the chip, rather than the interconnect.
>
>
> 3) i realized that increasing the so-called warm up parameter  in the
>
> OSU benchmark (argument -x 200 as default) the discrepancy.
>
> At the contrary putting lower threshold ( -x 10 ) can increase this BW
>
> discrepancy up to factor 300 at message size 16384 compare to
>
> message size 8192 for example.
>
> So does it means that there are some caching effects
>
> in the internode communication?
>
>
Probably. If you are using AMD 7551P nodes, these have 96K L1 cache per
core. A message of 16384 double precision uses 132K so will not fit in
L1 cache, and a message of 8192 uses 66K and will fit in L1 cache.
Perhaps try the same test on Intel Xeon e52680 nodes or 6248r nodes.

Some relevant studies are:
Zhong, Cao, Bosilica and Dongarra, "Using long vector extensions for MPI
reductions", https://doi.org/10.1016/j.parco.2021.102871

Hashmi, Chakraborty, Bayatpour, Subramoni and Panda "Designing Shared
Address Space MPI libraries in the Many-core Era",
https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf

Saini, Mehrotra, Taylor, Shende and Biswas, "Performance Analysis of
Scientific and Engineering Applications Using MPInside and TAU",
https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf

The second study by Hashmi et al. focuses on inter node communication,
but has a nice performance model that demonstrates understanding of the
communication pattern. For typical use of MPI on a particular cluster,
such a detailed understanding is likely not necessary. These studies do
also collect hardware performance information.

>  From my experience, to tune parameters is a time-consuming and cumbersome
>
> task.
>
>
> Could it also be the problem is not really on the openMPI implemenation
> but on the
>
> system?
The default OpenMPI parameters may need to be adjusted for a good user
experience on your system, but demanding users will probably do this for
their specific applications. By changing the algorithm used for all
reduce, you got a factor of 10 improvement in the benchmark for a size
of 16384.  Perhaps determine which MPI calls are used most often on your
cluster, and provide a guide as to how OpenMPI can be tuned for these.
Alternatively, if you have a set of heavily used applications, profile
them to determine most used MPI calls and

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-08 Thread Benson Muite via users

On 2/8/22 11:06 AM, Bertini, Denis Dr. via users wrote:

Hi

Thanks for all these informations !


But i have to confess that in this multi-tuning-parameter space,

i got somehow lost.

Furthermore it is somtimes mixing between user-space and kernel-space.

I have only possibility to act on the user space.
Ok. If you are doing the test to check your system, you should only tune 
for typical applications, rather than for one function call with a 
specific message size. As you can change the OpenMPI default settings 
for the algorithm used to do all reduce, you may wish to run tests and 
choose a setting that will work well for most of your users. You may 
also wish to upgrade to OpenMPI 4.1 as default, so perhaps do tests on 
that version.



1) So i have on the system max locked memory:

                         - ulimit -l unlimited (default )

   and i do not see any warnings/errors related to that when launching MPI.


2) I tried differents algorithms for MPI_all_reduce op.  all showing drop in

bw for size=16384
The drops are of different magnitude depending on the algorithm used, 
default gives worst case latency of 54970.96 us and round robin gives 
worst case latency of 4992.04 us for a size of 16384. May be helpful to 
indicate what hardware you are using, both for the chip (cache sizes 
will be important) and the interconnect. Perhaps try the test on 2 or 4 
nodes as well.



4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed

the same behaviour.

This suggests it is the chip, rather than the interconnect.



3) i realized that increasing the so-called warm up parameter  in the

OSU benchmark (argument -x 200 as default) the discrepancy.

At the contrary putting lower threshold ( -x 10 ) can increase this BW

discrepancy up to factor 300 at message size 16384 compare to

message size 8192 for example.

So does it means that there are some caching effects

in the internode communication?


Probably. If you are using AMD 7551P nodes, these have 96K L1 cache per 
core. A message of 16384 double precision uses 132K so will not fit in 
L1 cache, and a message of 8192 uses 66K and will fit in L1 cache. 
Perhaps try the same test on Intel Xeon e52680 nodes or 6248r nodes.


Some relevant studies are:
Zhong, Cao, Bosilica and Dongarra, "Using long vector extensions for MPI 
reductions", https://doi.org/10.1016/j.parco.2021.102871


Hashmi, Chakraborty, Bayatpour, Subramoni and Panda "Designing Shared 
Address Space MPI libraries in the Many-core Era", 
https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf


Saini, Mehrotra, Taylor, Shende and Biswas, "Performance Analysis of 
Scientific and Engineering Applications Using MPInside and TAU", 
https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf


The second study by Hashmi et al. focuses on inter node communication, 
but has a nice performance model that demonstrates understanding of the 
communication pattern. For typical use of MPI on a particular cluster, 
such a detailed understanding is likely not necessary. These studies do 
also collect hardware performance information.



 From my experience, to tune parameters is a time-consuming and cumbersome

task.


Could it also be the problem is not really on the openMPI implemenation 
but on the


system?
The default OpenMPI parameters may need to be adjusted for a good user 
experience on your system, but demanding users will probably do this for 
their specific applications. By changing the algorithm used for all 
reduce, you got a factor of 10 improvement in the benchmark for a size 
of 16384.  Perhaps determine which MPI calls are used most often on your 
cluster, and provide a guide as to how OpenMPI can be tuned for these. 
Alternatively, if you have a set of heavily used applications, profile 
them to determine most used MPI calls and then set defaults that would 
improve application performance.


Do also check whether there are any performance measurements available 
from your infiniband switch provider that will allow checking of correct 
functionality at the single switch level.



Best

Denis


*From:* users  on behalf of Gus Correa 
via users 

*Sent:* Monday, February 7, 2022 9:14:19 PM
*To:* Open MPI Users
*Cc:* Gus Correa
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband 
network

This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/ <https://www.open-mpi.org/faq/>
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning 
<https://www.open-mpi.org/faq/?category=tuning>
https://www.open-mpi.org/faq/?category=openfabrics 
<https://www.open-mpi.org/faq/?category=openfabrics>


1) Make sure you are not using the Ethernet TCP/IP, which is widely 
available in compute nodes:


mpirun 

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-08 Thread Bertini, Denis Dr. via users
Hi

Thanks for all these informations !


But i have to confess that in this multi-tuning-parameter space,

i got somehow lost.

Furthermore it is somtimes mixing between user-space and kernel-space.

I have only possibility to act on the user space.


1) So i have on the system max locked memory:

- ulimit -l unlimited (default )

  and i do not see any warnings/errors related to that when launching MPI.


2) I tried differents algorithms for MPI_all_reduce op.  all showing drop in

bw for size=16384


4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed

the same behaviour.


3) i realized that increasing the so-called warm up parameter  in the

OSU benchmark (argument -x 200 as default) the discrepancy.

At the contrary putting lower threshold ( -x 10 ) can increase this BW

discrepancy up to factor 300 at message size 16384 compare to

message size 8192 for example.

So does it means that there are some caching effects

in the internode communication?


From my experience, to tune parameters is a time-consuming and cumbersome

task.


Could it also be the problem is not really on the openMPI implemenation but on 
the

system?


Best

Denis


From: users  on behalf of Gus Correa via 
users 
Sent: Monday, February 7, 2022 9:14:19 PM
To: Open MPI Users
Cc: Gus Correa
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning
https://www.open-mpi.org/faq/?category=openfabrics

1) Make sure you are not using the Ethernet TCP/IP, which is widely available 
in compute nodes:

mpirun --mca btl self,sm,openib ...

https://www.open-mpi.org/faq/?category=tuning#selecting-components


However, this may have changed lately:
https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable

2) Maximum locked memory used by IB and their system limit. Start here:
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage

3) The eager vs. rendezvous message size threshold.
I wonder if it may sit right where you see the latency spike.
https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user

4) Processor and memory locality/affinity and binding (please check the current 
options and syntax)
https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4

On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users 
mailto:users@lists.open-mpi.org>> wrote:
Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather benchmarks?

Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are indicated.

If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:

ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail, type: int)
   Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
   Valid values: 0:"ignore", 1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
   MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
source: default, level: 5 tuner/detail, type: int)

For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.

[1]
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there is a way to run mpi in verbose mode in order
>
> to further investigate this behaviour?
>
> Best
>
> Denis
>
> 
> *From:* users 
> mailto:users-boun...@lists.open-mpi.org>> 
> on behalf of Benson
> Muite via users mailto:users@lists.open-mpi.org>>
> *Sent:* Monday, February 7, 2022 2:27:34 PM
> *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.or

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Gus Correa via users
This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning
https://www.open-mpi.org/faq/?category=openfabrics

1) Make sure you are not using the Ethernet TCP/IP, which is widely
available in compute nodes:

mpirun --mca btl self,sm,openib ...

https://www.open-mpi.org/faq/?category=tuning#selecting-components

However, this may have changed lately:
https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable

2) Maximum locked memory used by IB and their system limit. Start here:
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage

3) The eager vs. rendezvous message size threshold.
I wonder if it may sit right where you see the latency spike.
https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user

4) Processor and memory locality/affinity and binding (please check
the current options and syntax)
https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4


On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users <
users@lists.open-mpi.org> wrote:

> Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>
> mpirun --verbose --display-map
>
> Have you tried newer OpenMPI versions?
>
> Do you get similar behavior for the osu_reduce and osu_gather benchmarks?
>
> Typically internal buffer sizes as well as your hardware will affect
> performance. Can you give specifications similar to what is available at:
> http://mvapich.cse.ohio-state.edu/performance/collectives/
> where the operating system, switch, node type and memory are indicated.
>
> If you need good performance, may want to also specify the algorithm
> used. You can find some of the parameters you can tune using:
>
> ompi_info --all
>
> A particular helpful parameter is:
>
> MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
> value: "ignore", data source: default, level: 5 tuner/detail, type: int)
>Which allreduce algorithm is used. Can be
> locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
> reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
>Valid values: 0:"ignore", 1:"basic_linear",
> 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
> 5:"segmented_ring", 6:"rabenseifner"
>MCA coll tuned: parameter
> "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
> source: default, level: 5 tuner/detail, type: int)
>
> For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.
>
> [1]
>
> https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
> [2] https://github.com/open-mpi/ompi-collectives-tuning
>
> On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> > Hi
> >
> > When i repeat i always got the huge discrepancy at the
> >
> > message size of 16384.
> >
> > May be there is a way to run mpi in verbose mode in order
> >
> > to further investigate this behaviour?
> >
> > Best
> >
> > Denis
> >
> > ----
> > *From:* users  on behalf of Benson
> > Muite via users 
> > *Sent:* Monday, February 7, 2022 2:27:34 PM
> > *To:* users@lists.open-mpi.org
> > *Cc:* Benson Muite
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> > network
> > Hi,
> > Do you get similar results when you repeat the test? Another job could
> > have interfered with your run.
> > Benson
> > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
> >> Hi
> >>
> >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
> >> check/benchmark
> >>
> >> the infiniband network for our cluster.
> >>
> >> For that i use the collective all_reduce benchmark and run over 200
> >> nodes, using 1 process per node.
> >>
> >> And this is the results i obtained 
> >>
> >>
> >>
> >> 
> >>
> >> # OSU MPI Allreduce Latency Test v5.7.1
> >> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)
> Iterations
> >> 4 114.65 83.22147.98
> 1000
> >> 8 133.85106.47164.93
> 1000
> >> 16116.41 87.57  

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi

I changed the algorithm used to ring algorithm 4 ( for example ) and the

scan changed to


# OSU MPI Allreduce Latency Test v5.7.1
# Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4  59.39 51.04 65.36   1
8 109.13 90.14126.32   1
16253.26 60.89290.31   1
32 75.04 54.53 83.28   1
64 96.40 59.73111.45   1
12867.86 59.73 76.44   1
25676.32 67.33 85.18   1
512   129.93 85.76170.31   1
1024  168.51129.15194.68   1
2048  136.17110.09156.94   1
4096  173.59130.76199.21   1
8192  236.05170.77269.98   1
163844212.65   3627.71   4992.04   1
327681243.05   1205.11   1276.11   1
655361464.50   1364.76   1531.48   1
131072   1558.71   1454.52   1632.91   1
262144   1681.58   1609.15   1745.44   1
524288   2305.73   2178.17   2402.69   1
1048576  3389.83   3220.44   3517.61   1

Would this means that the first results was linked to the underlying algorithm 
used by defaults

in openMPI ( 0=ignore)?

Do you know what is this algorithm (0=ignore)?

I still see the wall for message=16384 though ...

Best

Denis







From: Benson Muite 
Sent: Monday, February 7, 2022 4:59:45 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather benchmarks?

Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are indicated.

If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:

ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail, type: int)
   Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
   Valid values: 0:"ignore", 1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
   MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
source: default, level: 5 tuner/detail, type: int)

For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.

[1]
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there is a way to run mpi in verbose mode in order
>
> to further investigate this behaviour?
>
> Best
>
> Denis
>
> 
> *From:* users  on behalf of Benson
> Muite via users 
> *Sent:* Monday, February 7, 2022 2:27:34 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Benson Muite
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
> Hi,
> Do you get similar results when you repeat the test? Another job could
> have interfered with your run.
> Benson
> On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
>> Hi
>>
>> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
>> check/benchmark
>>
>> the infiniband network for our cluster.
>>
>> For that i use the collective all_reduce benchmark and run over 200
>> nodes, using 1 process per node.
>&

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi,
I ran the all gather becnhmarks and got this values
which show also a step wise preformance drop as function
of message size.
Would this be linked to the underlying algorithm used for collective operation?


 OSU MPI Allgather Latency Test v5.7.1
# Size   Avg Latency(us)
1  70.36
2  47.01
4  72.42
8  49.62
16 57.93
32 50.11
64 57.29
12874.05
256   454.41
512   544.04
1024  580.96
2048  711.40
4096  905.14
8192 2002.32
163842652.59
327684034.35
655366816.29
131072  14280.11
262144  28451.46
524288  54719.41
1048576106607.19



I use srun and not mpirun, how to activate the flage for verbosity in that case?


Best

Denis



From: Benson Muite 
Sent: Monday, February 7, 2022 4:59:45 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather benchmarks?

Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are indicated.

If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:

ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail, type: int)
   Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
   Valid values: 0:"ignore", 1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
   MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
source: default, level: 5 tuner/detail, type: int)

For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.

[1]
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there is a way to run mpi in verbose mode in order
>
> to further investigate this behaviour?
>
> Best
>
> Denis
>
> 
> *From:* users  on behalf of Benson
> Muite via users 
> *Sent:* Monday, February 7, 2022 2:27:34 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Benson Muite
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
> Hi,
> Do you get similar results when you repeat the test? Another job could
> have interfered with your run.
> Benson
> On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
>> Hi
>>
>> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
>> check/benchmark
>>
>> the infiniband network for our cluster.
>>
>> For that i use the collective all_reduce benchmark and run over 200
>> nodes, using 1 process per node.
>>
>> And this is the results i obtained 
>>
>>
>>
>> 
>>
>> # OSU MPI Allreduce Latency Test v5.7.1
>> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
>> 4 114.65 83.22147.981000
>> 8 133.85106.47164.931000
>> 16116.41 87.57150.581000
>> 32112.17 93.25130.231000
>> 64106.85 81.93134.741000
>> 128   117.53 87.50152.271000
>> 256   143.08115.63173.971000
>> 512   130.34 

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Benson Muite via users

Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather benchmarks?

Typically internal buffer sizes as well as your hardware will affect 
performance. Can you give specifications similar to what is available at:

http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are indicated.

If you need good performance, may want to also specify the algorithm 
used. You can find some of the parameters you can tune using:


ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current 
value: "ignore", data source: default, level: 5 tuner/detail, type: int)
  Which allreduce algorithm is used. Can be 
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned 
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
  Valid values: 0:"ignore", 1:"basic_linear", 
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 
5:"segmented_ring", 6:"rabenseifner"
  MCA coll tuned: parameter 
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data 
source: default, level: 5 tuner/detail, type: int)


For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.

[1] 
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi

[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:

Hi

When i repeat i always got the huge discrepancy at the

message size of 16384.

May be there is a way to run mpi in verbose mode in order

to further investigate this behaviour?

Best

Denis


*From:* users  on behalf of Benson 
Muite via users 

*Sent:* Monday, February 7, 2022 2:27:34 PM
*To:* users@lists.open-mpi.org
*Cc:* Benson Muite
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband 
network

Hi,
Do you get similar results when you repeat the test? Another job could
have interfered with your run.
Benson
On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:

Hi

I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to 
check/benchmark


the infiniband network for our cluster.

For that i use the collective all_reduce benchmark and run over 200 
nodes, using 1 process per node.


And this is the results i obtained 





# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4                     114.65             83.22            147.98        1000
8                     133.85            106.47            164.93        1000
16                    116.41             87.57            150.58        1000
32                    112.17             93.25            130.23        1000
64                    106.85             81.93            134.74        1000
128                   117.53             87.50            152.27        1000
256                   143.08            115.63            173.97        1000
512                   130.34            100.20            167.56        1000
1024                  155.67            111.29            188.20        1000
2048                  151.82            116.03            198.19        1000
4096                  159.11            122.09            199.24        1000
8192                  176.74            143.54            221.98        1000
16384               48862.85          39270.21          54970.96        1000
32768                2737.37           2614.60           2802.68        1000
65536                2723.15           2585.62           2813.65        1000



Could someone explain me what is happening for message = 16384 ?
One can notice a huge latency (~ 300 time larger)  compare to message 
size = 8192.
I do not really understand what could  create such an increase in the 
latency.
The reason i use the OSU microbenchmarks is that we 
sporadically experience a drop
in the bandwith for typical collective operations such as MPI_Reduce in 
our cluster

which is difficult to understand.
I would be grateful if somebody can share its expertise or such problem 
with me.


Best,
Denis



-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. P

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi

When i repeat i always got the huge discrepancy at the

message size of 16384.

May be there is a way to run mpi in verbose mode in order

to further investigate this behaviour?

Best

Denis


From: users  on behalf of Benson Muite via 
users 
Sent: Monday, February 7, 2022 2:27:34 PM
To: users@lists.open-mpi.org
Cc: Benson Muite
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Hi,
Do you get similar results when you repeat the test? Another job could
have interfered with your run.
Benson
On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
> Hi
>
> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
> check/benchmark
>
> the infiniband network for our cluster.
>
> For that i use the collective all_reduce benchmark and run over 200
> nodes, using 1 process per node.
>
> And this is the results i obtained 
>
>
>
> 
>
> # OSU MPI Allreduce Latency Test v5.7.1
> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
> 4 114.65 83.22147.981000
> 8 133.85106.47164.931000
> 16116.41 87.57150.581000
> 32112.17 93.25130.231000
> 64106.85 81.93134.741000
> 128   117.53 87.50152.271000
> 256   143.08115.63173.971000
> 512   130.34100.20167.561000
> 1024  155.67111.29188.201000
> 2048  151.82116.03198.191000
> 4096  159.11122.09199.241000
> 8192  176.74143.54221.981000
> 16384   48862.85  39270.21  54970.961000
> 327682737.37   2614.60   2802.681000
> 655362723.15   2585.62   2813.651000
>
> 
>
> Could someone explain me what is happening for message = 16384 ?
> One can notice a huge latency (~ 300 time larger)  compare to message
> size = 8192.
> I do not really understand what could  create such an increase in the
> latency.
> The reason i use the OSU microbenchmarks is that we
> sporadically experience a drop
> in the bandwith for typical collective operations such as MPI_Reduce in
> our cluster
> which is difficult to understand.
> I would be grateful if somebody can share its expertise or such problem
> with me.
>
> Best,
> Denis
>
>
>
> -
> Denis Bertini
> Abteilung: CIT
> Ort: SB3 2.265a
>
> Tel: +49 6159 71 2240
> Fax: +49 6159 71 2986
> E-Mail: d.bert...@gsi.de
>
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
>
> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
> Managing Directors / Geschäftsführung:
> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
> Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
> Ministerialdirigent Dr. Volkmar Dietz
>



Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Benson Muite via users

Hi,
Do you get similar results when you repeat the test? Another job could 
have interfered with your run.

Benson
On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:

Hi

I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to 
check/benchmark


the infiniband network for our cluster.

For that i use the collective all_reduce benchmark and run over 200 
nodes, using 1 process per node.


And this is the results i obtained 





# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4                     114.65             83.22            147.98        1000
8                     133.85            106.47            164.93        1000
16                    116.41             87.57            150.58        1000
32                    112.17             93.25            130.23        1000
64                    106.85             81.93            134.74        1000
128                   117.53             87.50            152.27        1000
256                   143.08            115.63            173.97        1000
512                   130.34            100.20            167.56        1000
1024                  155.67            111.29            188.20        1000
2048                  151.82            116.03            198.19        1000
4096                  159.11            122.09            199.24        1000
8192                  176.74            143.54            221.98        1000
16384               48862.85          39270.21          54970.96        1000
32768                2737.37           2614.60           2802.68        1000
65536                2723.15           2585.62           2813.65        1000



Could someone explain me what is happening for message = 16384 ?
One can notice a huge latency (~ 300 time larger)  compare to message 
size = 8192.
I do not really understand what could  create such an increase in the 
latency.
The reason i use the OSU microbenchmarks is that we 
sporadically experience a drop
in the bandwith for typical collective operations such as MPI_Reduce in 
our cluster

which is difficult to understand.
I would be grateful if somebody can share its expertise or such problem 
with me.


Best,
Denis



-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz





[OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi

I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to 
check/benchmark

the infiniband network for our cluster.

For that i use the collective all_reduce benchmark and run over 200 nodes, 
using 1 process per node.

And this is the results i obtained 





# OSU MPI Allreduce Latency Test v5.7.1
# Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4 114.65 83.22147.981000
8 133.85106.47164.931000
16116.41 87.57150.581000
32112.17 93.25130.231000
64106.85 81.93134.741000
128   117.53 87.50152.271000
256   143.08115.63173.971000
512   130.34100.20167.561000
1024  155.67111.29188.201000
2048  151.82116.03198.191000
4096  159.11122.09199.241000
8192  176.74143.54221.981000
16384   48862.85  39270.21  54970.961000
327682737.37   2614.60   2802.681000
655362723.15   2585.62   2813.651000



Could someone explain me what is happening for message = 16384 ?
One can notice a huge latency (~ 300 time larger)  compare to message size = 
8192.
I do not really understand what could  create such an increase in the latency.
The reason i use the OSU microbenchmarks is that we sporadically experience a 
drop
in the bandwith for typical collective operations such as MPI_Reduce in our 
cluster
which is difficult to understand.
I would be grateful if somebody can share its expertise or such problem with me.

Best,
Denis



-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz