Michael, In general terms and assuming you are running all messages sizes in PIO Eager Mode, the communication is going to be affected by the CPU load. In other words, the bigger the message, the more CPU cycles to copy the buffer. Additionally, I have to say I’m not very certain how MPI_Send() will behave under the hood with temporary buffering. I think a more predicable behavior would be seen with MPI_Ssend(). Now, if you really don’t want to see the sender affected by the receiver load, you need to move to non-blocking calls MPI_Isend().
_MAC From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Xiaolong Cui Sent: Thursday, August 11, 2016 2:13 PM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] runtime performance tuning for Intel OMA interconnect Sorry, forgot the attachments. On Thu, Aug 11, 2016 at 5:06 PM, Xiaolong Cui <sunshine...@gmail.com<mailto:sunshine...@gmail.com>> wrote: Thanks! I tried it, but it didn't solve my problem. Maybe the reason is not eager/rndv. The reason why I want to always use eager mode is that I want to avoid a sender being slowed down by an unready receiver. I can prevent a sender from slowing down by always using eager mode on InfiniBand, just like your approach, but I cannot repeat this on OPA. Based on the experiments below, it seems to me that a sender will be delayed to some extent due to reasons other than eager/rndv. I designed a simple test (see hello_world.c in attachment) where there is one sender rank (r0) and one receiver rank (r1). r0 always runs at full speed, but r1 runs at full speed in one case and half speed in the second case. To run r1 at half speed, I collate a third rank r2 with r1 (see rankfile). Then I compare the completion time at r0 to see if there is a slow down when r1 is "unready to receive". The result is positive. But it is surprising that the delay varies significantly when I change the message length. This is different from my previous observation when eager/rndv is the cause. So my question is do you know other factors that cause a delay to a MPI_Send() when the receiver is not ready to receive? On Wed, Aug 10, 2016 at 11:48 PM, Cabral, Matias A <matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>> wrote: To remain in eager mode you need to increase the size of PSM2_MQ_RNDV_HFI_THRESH. PSM2_MQ_EAGER_SDMA_SZ is the threshold at which PSM changes from PIO (uses the CPU) to start setting SDMA engines. This summary may help: PIO Eager Mode: 0 bytes -> PSM2_MQ_EAGER_SDMA_SZ - 1 SDMA Eager Mode: PSM2_MQ_EAGER_SDMA_SZ -> PSM2_MQ_RNDV_HFI_THRESH - 1 RNDZ Expected: PSM2_MQ_RNDV_HFI_THRESH -> Largest supported value. Regards, _MAC From: users [mailto:users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>] On Behalf Of Xiaolong Cui Sent: Wednesday, August 10, 2016 7:19 PM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] runtime performance tuning for Intel OMA interconnect Hi Matias, Thanks a lot, that's very helpful! What I need indeed is to always use eager mode. But I didn't find any information about PSM2_MQ_EAGER_SDMA_SZ online. Would you please elaborate on "Just in case PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA, always in eager mode." Thanks! Michael On Wed, Aug 10, 2016 at 3:59 PM, Cabral, Matias A <matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>> wrote: Hi Michael, When Open MPI run on Omni-Path it will choose the PSM2 MTL by default, to use the libpsm2.so. Strictly speaking, it has compatibility to run using the openib BTL. However, the performance so significantly impacted that it is, not only discouraged, but no tuning would make sense. Regarding the PSM2 MTL, currently it only supports two mca parameters ("mtl_psm2_connect_timeout" and "mtl_psm2_priority") which are not for what you are looking for. Instead, you can set values directly in the PSM2 library with environment variables. Further info in the Programmers Guide: http://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v3_0.pdf More docs: https://www-ssl.intel.com/content/www/us/en/support/network-and-i-o/fabric-products/000016242.html?wapkw=psm2 Now, for your parameters: btl = openib,vader,self -> Ignore this one btl_openib_eager_limit = 160000 -> I don’t clearly see the diff with the below parameter. However, they are set to the same value. Just in case PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA, always in eager mode. btl_openib_rndv_eager_limit = 160000 -> PSM2_MQ_RNDV_HFI_THRESH btl_openib_max_send_size = 160000 -> does not apply to PSM2 btl_openib_receive_queues = P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,160000,1024,512,512 -> does not apply for PSM2. Thanks, Regards, _MAC BTW, should change the subject OMA -> OPA From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Xiaolong Cui Sent: Tuesday, August 09, 2016 2:22 PM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> Subject: [OMPI users] runtime performance tuning for Intel OMA interconnect I used to tune the performance of OpenMPI on InfiniBand by changing the OpenMPI MCA parameters for openib component (see https://www.open-mpi.org/faq/?category=openfabrics). Now I migrate to a new cluster that deploys Intel's omni-path interconnect, and my previous approach does not work any more. Does anyone know how to tune the performance for omni-path interconnect (what OpenMPI component to change) ? The version of OpenMPI is openmpi-1.10.2-hfi. I have included the output from opmi_info and openib parameters that I used to change. Thanks! Sincerely, Michael _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list -- Xiaolong Cui (Michael) Department of Computer Science Dietrich School of Arts & Science University of Pittsburgh Pittsburgh, PA 15260
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users