Re: [OMPI users] Bottleneck of OpenMPI over 100Gbps ROCE
Hi Joshua, Thank you very much for your help. I reply the email so late because I wanted to confirm the provided solution with MXM before that. However, unfortunately, I haven't used MXM correctly so far (two servers cannot communicate with MXM). I'll tell you if MXM solves the problem after I find a solution to the above problem. But currently, I think your solution is right because ib_read_bw also gets a bandwidth limit. Thanks, Zhaogeng -- Thanks Zhaogeng ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Bottleneck of OpenMPI over 100Gbps ROCE
Hi all, Sorry for resubmitting this problem because I found I didn't add the subject in the last email. I encountered a problem when I tested the performance of OpenMPI over ROCE 100Gbps. I have two servers connected with mellanox 100Gbps Connect-X4 ROCE NICs on them. I used intel mpi benchmark to test the performance of OpenMPI (1.10.3) over RDMA. I found the bandwidth of benchmark pingpong (2 ranks, every server has only one rank) could reach only 6GB/s (with openib btl). I also used osu mpi benchmark, the bandwidth could reach only 6.5GB/s. However, when I start two benchmarks at the same time, the total bandwidth can reach about 11GB/s (every server has two ranks). It seems that the CPU is the bottleneck. Obviously, the bottleneck is not memcpy. And RDMA itself ought not to comsume too much CPU resources, since the perftest of ib_write_bw can reach 11GB/s easily. Is the bandwidth limit is normal? Is there anyone know what is the real bottleneck? Thanks for your kindly help in advance. Regards, Zhaogeng ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] (no subject)
Hi all, I encountered a problem when I tested the performance of OpenMPI over ROCE 100Gbps. I have two servers connected with mellanox 100Gbps Connect-X4 ROCE NICs on them. I used intel mpi benchmark to test the performance of OpenMPI (1.10.3) over RDMA. I found the bandwidth of benchmark pingpong (2 ranks, every server has only one rank) could reach only 6GB/s (with openib btl). I also used osu mpi benchmark, the bandwidth could reach only 6.5GB/s. However, when I start two benchmarks at the same time, the total bandwidth can reach about 11GB/s (every server has two ranks). It seems that the CPU is the bottleneck. Obviously, the bottleneck is not memcpy. And RDMA itself ought not to comsume too much CPU resources, since the perftest of ib_write_bw can reach 11GB/s easily. Is the bandwidth limit is normal? Is there anyone know what is the real bottleneck? Thanks for your kindly help in advance. Regards, Zhaogeng ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users