Re: [OMPI users] runtime performance tuning for Intel OMA interconnect

Cabral, Matias A Thu, 11 Aug 2016 15:39:09 -0700

Michael,

In general terms and assuming you are running all messages sizes in PIO Eager 
Mode, the communication is going to be affected by the CPU load. In other 
words, the bigger the message, the more CPU cycles to copy the buffer. 
Additionally, I have to say I’m not very certain how MPI_Send() will behave 
under the hood with temporary buffering. I think a more predicable behavior 
would be seen with MPI_Ssend().  Now, if you really don’t want to see the 
sender affected by the receiver load, you need to move to non-blocking calls 
MPI_Isend().

_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Xiaolong Cui
Sent: Thursday, August 11, 2016 2:13 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] runtime performance tuning for Intel OMA interconnect

Sorry, forgot the attachments.

On Thu, Aug 11, 2016 at 5:06 PM, Xiaolong Cui 
<sunshine...@gmail.com<mailto:sunshine...@gmail.com>> wrote:
Thanks! I tried it, but it didn't solve my problem. Maybe the reason is not 
eager/rndv.

The reason why I want to always use eager mode is that I want to avoid a sender 
being slowed down by an unready receiver. I can prevent a sender from slowing 
down by always using eager mode on InfiniBand, just like your approach, but I 
cannot repeat this on OPA. Based on the experiments below, it seems to me that 
a sender will be delayed to some extent due to reasons other than eager/rndv.

I designed a simple test (see hello_world.c in attachment) where there is one 
sender rank (r0) and one receiver rank (r1). r0 always runs at full speed, but 
r1 runs at full speed in one case and half speed in the second case. To run r1 
at half speed, I collate a third rank r2 with r1 (see rankfile). Then I compare 
the completion time at r0 to see if there is a slow down when r1 is "unready to 
receive". The result is positive. But it is surprising that the delay varies 
significantly when I change the message length. This is different from my 
previous observation when eager/rndv is the cause.

So my question is do you know other factors that cause a delay to a MPI_Send() 
when the receiver is not ready to receive?

On Wed, Aug 10, 2016 at 11:48 PM, Cabral, Matias A 
<matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>> wrote:
To remain in eager mode you need to increase the size of 
PSM2_MQ_RNDV_HFI_THRESH.
PSM2_MQ_EAGER_SDMA_SZ is the threshold at which PSM changes from PIO (uses the 
CPU) to start setting SDMA engines.  This summary may help:

PIO Eager Mode:              0 bytes -> PSM2_MQ_EAGER_SDMA_SZ - 1
SDMA Eager Mode:        PSM2_MQ_EAGER_SDMA_SZ -> PSM2_MQ_RNDV_HFI_THRESH - 1
RNDZ Expected:               PSM2_MQ_RNDV_HFI_THRESH -> Largest supported value.

Regards,

_MAC

From: users 
[mailto:users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>]
 On Behalf Of Xiaolong Cui
Sent: Wednesday, August 10, 2016 7:19 PM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] runtime performance tuning for Intel OMA interconnect

Hi Matias,

Thanks a lot, that's very helpful!

What I need indeed is to always use eager mode. But I didn't find any 
information about PSM2_MQ_EAGER_SDMA_SZ online. Would you please elaborate on 
"Just in case PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA, always in eager mode."

Thanks!
Michael

On Wed, Aug 10, 2016 at 3:59 PM, Cabral, Matias A 
<matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>> wrote:
Hi Michael,

When Open MPI run on Omni-Path it will choose the PSM2 MTL by default, to use 
the libpsm2.so. Strictly speaking, it has compatibility to run using the openib 
BTL. However, the performance so significantly impacted that it is, not only 
discouraged, but no tuning would make sense. Regarding the PSM2 MTL, currently 
it only supports two mca parameters ("mtl_psm2_connect_timeout" and 
"mtl_psm2_priority") which are not for what you are looking for. Instead, you 
can set values directly in the PSM2 library with environment variables.  
Further info in the Programmers Guide:

http://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v3_0.pdf
More docs:

https://www-ssl.intel.com/content/www/us/en/support/network-and-i-o/fabric-products/000016242.html?wapkw=psm2

Now, for your parameters:

btl = openib,vader,self  -> Ignore this one
btl_openib_eager_limit = 160000   -> I don’t clearly see the diff with the 
below parameter. However, they are set to the same value. Just in case 
PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA, always in eager mode.
btl_openib_rndv_eager_limit = 160000  -> PSM2_MQ_RNDV_HFI_THRESH
btl_openib_max_send_size = 160000   -> does not apply to PSM2
btl_openib_receive_queues = 
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,160000,1024,512,512
  -> does not apply for PSM2.

Thanks,
Regards,

_MAC
BTW, should change the subject OMA -> OPA

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Xiaolong Cui
Sent: Tuesday, August 09, 2016 2:22 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Subject: [OMPI users] runtime performance tuning for Intel OMA interconnect

I used to tune the performance of OpenMPI on InfiniBand by changing the OpenMPI 
MCA parameters for openib component (see 
https://www.open-mpi.org/faq/?category=openfabrics). Now I migrate to a new 
cluster that deploys Intel's omni-path interconnect, and my previous approach 
does not work any more. Does anyone know how to tune the performance for 
omni-path interconnect (what OpenMPI component to change) ?

The version of OpenMPI is openmpi-1.10.2-hfi. I have included the output from 
opmi_info and openib parameters that I used to change. Thanks!

Sincerely,
Michael

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list

--
Xiaolong Cui (Michael)
Department of Computer Science
Dietrich School of Arts & Science
University of Pittsburgh
Pittsburgh, PA 15260

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] runtime performance tuning for Intel OMA interconnect

Reply via email to