Re: [OMPI users] Severe performance issue with PSM2 and single-node CP2K jobs

2017-02-21 Thread Iliev, Hristo
Hi Jingchao,

 

My bad, I should have read closer into your thread. The problem is indeed in
that CP2K calls MPI_Alloc_mem to allocate memory for practically everything
and all the time. This somehow managed to escape our earlier profiling runs,
perhaps because we were too concentrated on finding a communication issue.
We profiled the program again with a different tool and it showed 70% of the
run time spent in memory allocation. Disabling the openib BTL prevents the
memory registration and solves the issue. It appears we will be disabling
the openib BTL on the entire Omni-Path partition.

 

Regards,

Hristo

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Jingchao
Zhang
Sent: Wednesday, February 08, 2017 6:40 PM
To: users@lists.open-mpi.org
Subject: Re: [OMPI users] Severe performance issue with PSM2 and single-node
CP2K jobs

 

Hi Hristo,

 

We have a similar problem here and I started a thread a few days ago.
https://mail-archive.com/users@lists.open-mpi.org/msg30581.html

 

Regard,

Jingchao

  _  

From: users <users-boun...@lists.open-mpi.org> on behalf of Iliev, Hristo
<il...@itc.rwth-aachen.de>
Sent: Wednesday, February 8, 2017 10:43:54 AM
To: users@lists.open-mpi.org
Subject: [OMPI users] Severe performance issue with PSM2 and single-node
CP2K jobs 

 

Hi,

While trying to debug a severe performance regression of CP2K runs with Open

MPI 1.10.4 on our new cluster, after reproducing the problem with
single-node 
jobs too, we found out that the root cause is that the presence of Intel 
Omni-Path hardware triggers the use of the cm PML and consequently the use
of 
psm2 MTL for shared-memory communication instead of the sm BTL. As
subsequent 
tests with NetPIPE on a single socket showed (see the attached graph), the 
ping-pong latency of PSM2's shared-memory implementation is always 
significantly higher (20-60%) except for a relatively narrow range of
message 
lengths 10-100 KiB, for which it is faster. Tests with processes on two 
sockets show that sm outperforms psm2 with smaller message sizes and psm2 
outperforms sm for larger message sizes, at least for messages of less than
32 
MiB. The real problem is though that the ScaLAPACK routines used by CP2K 
further exaggerate the difference, which results in orders of magnitude
slower 
execution. We've tested it with both MKL and ScaLAPACK (and even BLAS) from 
Netlib in order to exclude possible performance regressions in MKL when used

with Open MPI, which is our default configuration.

While disabling the psm2 MTL or enforcing the ob1 PML is a viable workaround

for single-node jobs, it is not really a solution to our problem in general
as 
utilising Omni-Path via its InfiniBand interface results in high latency and

poor network bandwidth. As expected, disabling the "shm" device of PSM2 
crashes the program.

My question is actually whether it is currently possible for several PMLs to

coexist and to be used at the same time? Ideally, ob1 driving the sm BTL for

intranode communication and cm driving the psm2 MTL for internode 
communication. From my limited understanding of the Open MPI source code,
that 
doesn't really seem possible. While the psm2 MTL appears to be a relatively 
thin wrapper around the PSM2 API and therefore the problem might not really
be 
in Open MPI but in the PSM2 library itself, it somehow does not affect Intel

MPI. It seems to be a CP2K specific problem as a different software (Quantum

ESPRESSO built with ScaLAPACK) runs fine, but then it could be the due to 
different ScaLAPACK routines being used.

The attached graphs show the ratio of the MPI ping-pong latency as measured
by 
NetPIPE when run as follows:

mpiexec -n 2 --map-by core/socket --bind-to core NPmpi -a -I -l 1 -u
33554432

with and without --mca pml ob1. I also performed tests with Linux CMA
support 
in PSM2 switched on and off (it is on by default), which doesn't change
much. 
Our default Open MPI is built without CMA support.

Has anyone successfully run ScaLAPACK applications, and CP2K in particular,
on 
systems with Intel Omni-Path? Perhaps I'm missing something here?

I'm sorry if this has already been discussed here. I went through the list 
archives, but couldn't find anything. If it was, I would be grateful if
anyone 
could provide pointers to the relevant thread(s).

Kind regards,
Hristo
--
Hristo Iliev, PhD
JARA-HPC CSG "Parallel Efficiency"

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
52074 Aachen, Germany
Tel: +49 (241) 80-24367
Fax: +49 (241) 80-624367
il...@itc.rwth-aachen.de
http://www.itc.rwth-aachen.de





smime.p7s
Description: S/MIME cryptographic signature
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Severe performance issue with PSM2 and single-node CP2K jobs

2017-02-08 Thread Cabral, Matias A
Hi Hristo,

As you mention I have seen that the sm btl shows better performance for smaller 
messages than PMS2 shm device does, by running some osu benchmarks (especially 
BW for msg<256B). I even suspect that the difference would be more notable if 
you test the vader btl.  However, the piece that I would need to look at is 
what is CP2K particularly doing to make this difference so bad. 
As per today, unfortunately OMPI will not allow you to use the sm btl and the 
psm2 mtl simultaneously. Additionally, forcing to use the OmniPath verbs API 
(openib btl) will run much slower (this is not intended to be the API for 
message passing).

As mentioned above, I'll be looking at CP2k. 

Thanks, 

_MAC


-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Iliev, Hristo
Sent: Wednesday, February 08, 2017 8:44 AM
To: users@lists.open-mpi.org
Subject: [OMPI users] Severe performance issue with PSM2 and single-node CP2K 
jobs

Hi,

While trying to debug a severe performance regression of CP2K runs with Open 
MPI 1.10.4 on our new cluster, after reproducing the problem with single-node 
jobs too, we found out that the root cause is that the presence of Intel 
Omni-Path hardware triggers the use of the cm PML and consequently the use of
psm2 MTL for shared-memory communication instead of the sm BTL. As subsequent 
tests with NetPIPE on a single socket showed (see the attached graph), the 
ping-pong latency of PSM2's shared-memory implementation is always 
significantly higher (20-60%) except for a relatively narrow range of message 
lengths 10-100 KiB, for which it is faster. Tests with processes on two sockets 
show that sm outperforms psm2 with smaller message sizes and psm2 outperforms 
sm for larger message sizes, at least for messages of less than 32 MiB. The 
real problem is though that the ScaLAPACK routines used by CP2K further 
exaggerate the difference, which results in orders of magnitude slower 
execution. We've tested it with both MKL and ScaLAPACK (and even BLAS) from 
Netlib in order to exclude possible performance regressions in MKL when used 
with Open MPI, which is our default configuration.

While disabling the psm2 MTL or enforcing the ob1 PML is a viable workaround 
for single-node jobs, it is not really a solution to our problem in general as 
utilising Omni-Path via its InfiniBand interface results in high latency and 
poor network bandwidth. As expected, disabling the "shm" device of PSM2 crashes 
the program.

My question is actually whether it is currently possible for several PMLs to 
coexist and to be used at the same time? Ideally, ob1 driving the sm BTL for 
intranode communication and cm driving the psm2 MTL for internode 
communication. From my limited understanding of the Open MPI source code, that 
doesn't really seem possible. While the psm2 MTL appears to be a relatively 
thin wrapper around the PSM2 API and therefore the problem might not really be 
in Open MPI but in the PSM2 library itself, it somehow does not affect Intel 
MPI. It seems to be a CP2K specific problem as a different software (Quantum 
ESPRESSO built with ScaLAPACK) runs fine, but then it could be the due to 
different ScaLAPACK routines being used.

The attached graphs show the ratio of the MPI ping-pong latency as measured by 
NetPIPE when run as follows:

mpiexec -n 2 --map-by core/socket --bind-to core NPmpi -a -I -l 1 -u 33554432

with and without --mca pml ob1. I also performed tests with Linux CMA support 
in PSM2 switched on and off (it is on by default), which doesn't change much. 
Our default Open MPI is built without CMA support.

Has anyone successfully run ScaLAPACK applications, and CP2K in particular, on 
systems with Intel Omni-Path? Perhaps I'm missing something here?

I'm sorry if this has already been discussed here. I went through the list 
archives, but couldn't find anything. If it was, I would be grateful if anyone 
could provide pointers to the relevant thread(s).

Kind regards,
Hristo
--
Hristo Iliev, PhD
JARA-HPC CSG "Parallel Efficiency"

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering RWTH Aachen University 
Seffenter Weg 23
52074 Aachen, Germany
Tel: +49 (241) 80-24367
Fax: +49 (241) 80-624367
il...@itc.rwth-aachen.de
http://www.itc.rwth-aachen.de


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Severe performance issue with PSM2 and single-node CP2K jobs

2017-02-08 Thread Jingchao Zhang
Hi Hristo,


We have a similar problem here and I started a thread a few days ago. 
https://mail-archive.com/users@lists.open-mpi.org/msg30581.html


Regard,

Jingchao


From: users <users-boun...@lists.open-mpi.org> on behalf of Iliev, Hristo 
<il...@itc.rwth-aachen.de>
Sent: Wednesday, February 8, 2017 10:43:54 AM
To: users@lists.open-mpi.org
Subject: [OMPI users] Severe performance issue with PSM2 and single-node CP2K 
jobs

Hi,

While trying to debug a severe performance regression of CP2K runs with Open
MPI 1.10.4 on our new cluster, after reproducing the problem with single-node
jobs too, we found out that the root cause is that the presence of Intel
Omni-Path hardware triggers the use of the cm PML and consequently the use of
psm2 MTL for shared-memory communication instead of the sm BTL. As subsequent
tests with NetPIPE on a single socket showed (see the attached graph), the
ping-pong latency of PSM2's shared-memory implementation is always
significantly higher (20-60%) except for a relatively narrow range of message
lengths 10-100 KiB, for which it is faster. Tests with processes on two
sockets show that sm outperforms psm2 with smaller message sizes and psm2
outperforms sm for larger message sizes, at least for messages of less than 32
MiB. The real problem is though that the ScaLAPACK routines used by CP2K
further exaggerate the difference, which results in orders of magnitude slower
execution. We've tested it with both MKL and ScaLAPACK (and even BLAS) from
Netlib in order to exclude possible performance regressions in MKL when used
with Open MPI, which is our default configuration.

While disabling the psm2 MTL or enforcing the ob1 PML is a viable workaround
for single-node jobs, it is not really a solution to our problem in general as
utilising Omni-Path via its InfiniBand interface results in high latency and
poor network bandwidth. As expected, disabling the "shm" device of PSM2
crashes the program.

My question is actually whether it is currently possible for several PMLs to
coexist and to be used at the same time? Ideally, ob1 driving the sm BTL for
intranode communication and cm driving the psm2 MTL for internode
communication. From my limited understanding of the Open MPI source code, that
doesn't really seem possible. While the psm2 MTL appears to be a relatively
thin wrapper around the PSM2 API and therefore the problem might not really be
in Open MPI but in the PSM2 library itself, it somehow does not affect Intel
MPI. It seems to be a CP2K specific problem as a different software (Quantum
ESPRESSO built with ScaLAPACK) runs fine, but then it could be the due to
different ScaLAPACK routines being used.

The attached graphs show the ratio of the MPI ping-pong latency as measured by
NetPIPE when run as follows:

mpiexec -n 2 --map-by core/socket --bind-to core NPmpi -a -I -l 1 -u 33554432

with and without --mca pml ob1. I also performed tests with Linux CMA support
in PSM2 switched on and off (it is on by default), which doesn't change much.
Our default Open MPI is built without CMA support.

Has anyone successfully run ScaLAPACK applications, and CP2K in particular, on
systems with Intel Omni-Path? Perhaps I'm missing something here?

I'm sorry if this has already been discussed here. I went through the list
archives, but couldn't find anything. If it was, I would be grateful if anyone
could provide pointers to the relevant thread(s).

Kind regards,
Hristo
--
Hristo Iliev, PhD
JARA-HPC CSG "Parallel Efficiency"

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
52074 Aachen, Germany
Tel: +49 (241) 80-24367
Fax: +49 (241) 80-624367
il...@itc.rwth-aachen.de
http://www.itc.rwth-aachen.de


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Severe performance issue with PSM2 and single-node CP2K jobs

2017-02-08 Thread Iliev, Hristo
Hi,

While trying to debug a severe performance regression of CP2K runs with Open 
MPI 1.10.4 on our new cluster, after reproducing the problem with single-node 
jobs too, we found out that the root cause is that the presence of Intel 
Omni-Path hardware triggers the use of the cm PML and consequently the use of 
psm2 MTL for shared-memory communication instead of the sm BTL. As subsequent 
tests with NetPIPE on a single socket showed (see the attached graph), the 
ping-pong latency of PSM2's shared-memory implementation is always 
significantly higher (20-60%) except for a relatively narrow range of message 
lengths 10-100 KiB, for which it is faster. Tests with processes on two 
sockets show that sm outperforms psm2 with smaller message sizes and psm2 
outperforms sm for larger message sizes, at least for messages of less than 32 
MiB. The real problem is though that the ScaLAPACK routines used by CP2K 
further exaggerate the difference, which results in orders of magnitude slower 
execution. We've tested it with both MKL and ScaLAPACK (and even BLAS) from 
Netlib in order to exclude possible performance regressions in MKL when used 
with Open MPI, which is our default configuration.

While disabling the psm2 MTL or enforcing the ob1 PML is a viable workaround 
for single-node jobs, it is not really a solution to our problem in general as 
utilising Omni-Path via its InfiniBand interface results in high latency and 
poor network bandwidth. As expected, disabling the "shm" device of PSM2 
crashes the program.

My question is actually whether it is currently possible for several PMLs to 
coexist and to be used at the same time? Ideally, ob1 driving the sm BTL for 
intranode communication and cm driving the psm2 MTL for internode 
communication. From my limited understanding of the Open MPI source code, that 
doesn't really seem possible. While the psm2 MTL appears to be a relatively 
thin wrapper around the PSM2 API and therefore the problem might not really be 
in Open MPI but in the PSM2 library itself, it somehow does not affect Intel 
MPI. It seems to be a CP2K specific problem as a different software (Quantum 
ESPRESSO built with ScaLAPACK) runs fine, but then it could be the due to 
different ScaLAPACK routines being used.

The attached graphs show the ratio of the MPI ping-pong latency as measured by 
NetPIPE when run as follows:

mpiexec -n 2 --map-by core/socket --bind-to core NPmpi -a -I -l 1 -u 33554432

with and without --mca pml ob1. I also performed tests with Linux CMA support 
in PSM2 switched on and off (it is on by default), which doesn't change much. 
Our default Open MPI is built without CMA support.

Has anyone successfully run ScaLAPACK applications, and CP2K in particular, on 
systems with Intel Omni-Path? Perhaps I'm missing something here?

I'm sorry if this has already been discussed here. I went through the list 
archives, but couldn't find anything. If it was, I would be grateful if anyone 
could provide pointers to the relevant thread(s).

Kind regards,
Hristo
--
Hristo Iliev, PhD
JARA-HPC CSG "Parallel Efficiency"

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
52074 Aachen, Germany
Tel: +49 (241) 80-24367
Fax: +49 (241) 80-624367
il...@itc.rwth-aachen.de
http://www.itc.rwth-aachen.de




smime.p7s
Description: S/MIME cryptographic signature
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users