Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Mathieu Gontier

Dear all,

Thanks a lot for your support. I only had time to test one today, but it 
seems the option --mca mpi_leave_pinned 0 works on my case. I will go 
further next week, but for the moment, I can submit my computation.


Thanks a lot for your help.

On 06/23/2011 03:56 PM, Mathieu Gontier wrote:

Hello,

Thank for the answer.
I am testing with OpenMPI-1.4.3: my computation is queuing. But I did 
not read anything obvious related to my issue. Have you read something 
which could solve it?
I am going to submit my computation with --mca mpi_leave_pinned 0, but 
do you have any idea how it affect the performance? Compared to using 
Ethernet?


Many thanks for your support.

On 06/23/2011 03:01 PM, Josh Hursey wrote:

I wonder if this is related to memory pinning. Can you try turning off
the leave pinned, and see if the problem persists (this may affect
performance, but should avoid the crash):
   mpirun ... --mca mpi_leave_pinned 0 ...

Also it looks like Smoky has a slightly newer version of the 1.4
branch that you should try to switch to if you can. The following
command will show you all of the available installs on that machine:
   shell$ module avail ompi

For a list of supported compilers for that version try the 'show' option:
shell$ module show ompi/1.4.3
---
/sw/smoky/modulefiles-centos/ompi/1.4.3:

module-whatisThis module configures your environment to make Open
MPI 1.4.3 available.
Supported Compilers:
  pathscale/3.2.99
  pathscale/3.2
  pgi/10.9
  pgi/10.4
  intel/11.1.072
  gcc/4.4.4
  gcc/4.4.3
---

Let me know if that helps.

Josh


On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
  wrote:

Dear all,

First of all, all my apologies because I post this message to both the bug
and user mailing list. But for the moment, I do not know if it is a bug!

I am running a CFD structured flow solver at ORNL, and I have an access to a
small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
Recently we increased the size of our models, and since that time we have
run into many infiniband related problems.  The most serious problem is a
hard crash with the following error message:

[smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
error creating qp errno says Cannot allocate memory

If we force the solver to use ethernet (mpirun -mca btl ^openib) the
computations works correctly, although very slowly (a single iteration take
ages). Do you have any idea what could be causing these problems?

If it is due to a bug or a limitation into OpenMPI, do you think the version
1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
the release notes, but I did not read any obvious patch which could fix my
problem. The system administrator is ready to compile a new package for us,
but I do not want to ask to install to many of them.

Thanks.
--

Mathieu Gontier
skype: mathieu_gontier
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
/
Mathieu Gontier
skype: mathieu_gontier /


--
/
Mathieu Gontier
skype: mathieu_gontier /


Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Samuel K. Gutierrez
Hi,

QP = Queue Pair

Here are a couple of nice FAQ entries that I find useful.

http://www.open-mpi.org/faq/?category=openfabrics

And videos:

http://www.open-mpi.org/video/?category=openfabrics


--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Jun 23, 2011, at 8:22 AM, Mathieu Gontier wrote:

> Hi,
> 
> Thanks for your answer. It makes sense. 
> Sorry if my question seems silly, but what does QP mean? It is difficult to 
> read the FAQ without knowing that!
> 
> Thanks. 
> 
> On 06/23/2011 04:00 PM, Ralph Castain wrote:
>> 
>> One possibility: if you increase the number of processes in the job, and 
>> they all interconnect, then the IB interface can (I believe) run out of 
>> memory at some point. IIRC, the answer was to reduce the size of the QPs so 
>> that you could support a larger number of them.
>> 
>> You should find info about controlling QP size in the IB FAQ area on the 
>> OMPI web site, I believe.
>> 
>> On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote:
>> 
>>> Hello, 
>>> 
>>> Thank for the answer.
>>> I am testing with OpenMPI-1.4.3: my computation is queuing. But I did not 
>>> read anything obvious related to my issue. Have you read something which 
>>> could solve it? 
>>> I am going to submit my computation with --mca mpi_leave_pinned 0, but do 
>>> you have any idea how it affect the performance? Compared to using 
>>> Ethernet? 
>>> 
>>> Many thanks for your support. 
>>> 
>>> On 06/23/2011 03:01 PM, Josh Hursey wrote:
 
 I wonder if this is related to memory pinning. Can you try turning off
 the leave pinned, and see if the problem persists (this may affect
 performance, but should avoid the crash):
   mpirun ... --mca mpi_leave_pinned 0 ...
 
 Also it looks like Smoky has a slightly newer version of the 1.4
 branch that you should try to switch to if you can. The following
 command will show you all of the available installs on that machine:
   shell$ module avail ompi
 
 For a list of supported compilers for that version try the 'show' option:
 shell$ module show ompi/1.4.3
 ---
 /sw/smoky/modulefiles-centos/ompi/1.4.3:
 
 module-whatis   This module configures your environment to make Open
 MPI 1.4.3 available.
 Supported Compilers:
  pathscale/3.2.99
  pathscale/3.2
  pgi/10.9
  pgi/10.4
  intel/11.1.072
  gcc/4.4.4
  gcc/4.4.3
 ---
 
 Let me know if that helps.
 
 Josh
 
 
 On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
  wrote:
> Dear all,
> 
> First of all, all my apologies because I post this message to both the bug
> and user mailing list. But for the moment, I do not know if it is a bug!
> 
> I am running a CFD structured flow solver at ORNL, and I have an access 
> to a
> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
> Recently we increased the size of our models, and since that time we have
> run into many infiniband related problems.  The most serious problem is a
> hard crash with the following error message:
> 
> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> error creating qp errno says Cannot allocate memory
> 
> If we force the solver to use ethernet (mpirun -mca btl ^openib) the
> computations works correctly, although very slowly (a single iteration 
> take
> ages). Do you have any idea what could be causing these problems?
> 
> If it is due to a bug or a limitation into OpenMPI, do you think the 
> version
> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
> the release notes, but I did not read any obvious patch which could fix my
> problem. The system administrator is ready to compile a new package for 
> us,
> but I do not want to ask to install to many of them.
> 
> Thanks.
> --
> 
> Mathieu Gontier
> skype: mathieu_gontier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
 
>>> 
>>> -- 
>>> 
>>> Mathieu Gontier 
>>> skype: mathieu_gontier
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> 
> Mathieu Gontier 
> skype: mathieu_gontier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users








Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Samuel K. Gutierrez
Hi,

What happens when you don't run with per-peer queue pairs?  Try:

-mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,128

--
Samuel K. Gutierrez
Los Alamos National Laborator


On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote:

> Hello, 
> 
> Thank for the answer.
> I am testing with OpenMPI-1.4.3: my computation is queuing. But I did not 
> read anything obvious related to my issue. Have you read something which 
> could solve it? 
> I am going to submit my computation with --mca mpi_leave_pinned 0, but do you 
> have any idea how it affect the performance? Compared to using Ethernet? 
> 
> Many thanks for your support. 
> 
> On 06/23/2011 03:01 PM, Josh Hursey wrote:
>> 
>> I wonder if this is related to memory pinning. Can you try turning off
>> the leave pinned, and see if the problem persists (this may affect
>> performance, but should avoid the crash):
>>   mpirun ... --mca mpi_leave_pinned 0 ...
>> 
>> Also it looks like Smoky has a slightly newer version of the 1.4
>> branch that you should try to switch to if you can. The following
>> command will show you all of the available installs on that machine:
>>   shell$ module avail ompi
>> 
>> For a list of supported compilers for that version try the 'show' option:
>> shell$ module show ompi/1.4.3
>> ---
>> /sw/smoky/modulefiles-centos/ompi/1.4.3:
>> 
>> module-whatis This module configures your environment to make Open
>> MPI 1.4.3 available.
>> Supported Compilers:
>>  pathscale/3.2.99
>>  pathscale/3.2
>>  pgi/10.9
>>  pgi/10.4
>>  intel/11.1.072
>>  gcc/4.4.4
>>  gcc/4.4.3
>> ---
>> 
>> Let me know if that helps.
>> 
>> Josh
>> 
>> 
>> On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
>>  wrote:
>>> Dear all,
>>> 
>>> First of all, all my apologies because I post this message to both the bug
>>> and user mailing list. But for the moment, I do not know if it is a bug!
>>> 
>>> I am running a CFD structured flow solver at ORNL, and I have an access to a
>>> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
>>> Recently we increased the size of our models, and since that time we have
>>> run into many infiniband related problems.  The most serious problem is a
>>> hard crash with the following error message:
>>> 
>>> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>>> error creating qp errno says Cannot allocate memory
>>> 
>>> If we force the solver to use ethernet (mpirun -mca btl ^openib) the
>>> computations works correctly, although very slowly (a single iteration take
>>> ages). Do you have any idea what could be causing these problems?
>>> 
>>> If it is due to a bug or a limitation into OpenMPI, do you think the version
>>> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
>>> the release notes, but I did not read any obvious patch which could fix my
>>> problem. The system administrator is ready to compile a new package for us,
>>> but I do not want to ask to install to many of them.
>>> 
>>> Thanks.
>>> --
>>> 
>>> Mathieu Gontier
>>> skype: mathieu_gontier
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
> 
> -- 
> 
> Mathieu Gontier 
> skype: mathieu_gontier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Mathieu Gontier

Hi,

Thanks for your answer. It makes sense.
Sorry if my question seems silly, but what does QP mean? It is difficult 
to read the FAQ without knowing that!


Thanks.

On 06/23/2011 04:00 PM, Ralph Castain wrote:
One possibility: if you increase the number of processes in the job, 
and they all interconnect, then the IB interface can (I believe) run 
out of memory at some point. IIRC, the answer was to reduce the size 
of the QPs so that you could support a larger number of them.


You should find info about controlling QP size in the IB FAQ area on 
the OMPI web site, I believe.


On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote:


Hello,

Thank for the answer.
I am testing with OpenMPI-1.4.3: my computation is queuing. But I did 
not read anything obvious related to my issue. Have you read 
something which could solve it?
I am going to submit my computation with --mca mpi_leave_pinned 0, 
but do you have any idea how it affect the performance? Compared to 
using Ethernet?


Many thanks for your support.

On 06/23/2011 03:01 PM, Josh Hursey wrote:

I wonder if this is related to memory pinning. Can you try turning off
the leave pinned, and see if the problem persists (this may affect
performance, but should avoid the crash):
   mpirun ... --mca mpi_leave_pinned 0 ...

Also it looks like Smoky has a slightly newer version of the 1.4
branch that you should try to switch to if you can. The following
command will show you all of the available installs on that machine:
   shell$ module avail ompi

For a list of supported compilers for that version try the 'show' option:
shell$ module show ompi/1.4.3
---
/sw/smoky/modulefiles-centos/ompi/1.4.3:

module-whatisThis module configures your environment to make Open
MPI 1.4.3 available.
Supported Compilers:
  pathscale/3.2.99
  pathscale/3.2
  pgi/10.9
  pgi/10.4
  intel/11.1.072
  gcc/4.4.4
  gcc/4.4.3
---

Let me know if that helps.

Josh


On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
  wrote:

Dear all,

First of all, all my apologies because I post this message to both the bug
and user mailing list. But for the moment, I do not know if it is a bug!

I am running a CFD structured flow solver at ORNL, and I have an access to a
small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
Recently we increased the size of our models, and since that time we have
run into many infiniband related problems.  The most serious problem is a
hard crash with the following error message:

[smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
error creating qp errno says Cannot allocate memory

If we force the solver to use ethernet (mpirun -mca btl ^openib) the
computations works correctly, although very slowly (a single iteration take
ages). Do you have any idea what could be causing these problems?

If it is due to a bug or a limitation into OpenMPI, do you think the version
1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
the release notes, but I did not read any obvious patch which could fix my
problem. The system administrator is ready to compile a new package for us,
but I do not want to ask to install to many of them.

Thanks.
--

Mathieu Gontier
skype: mathieu_gontier
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
/
Mathieu Gontier
skype: mathieu_gontier /
___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
/
Mathieu Gontier
skype: mathieu_gontier /


Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Ralph Castain
One possibility: if you increase the number of processes in the job, and they 
all interconnect, then the IB interface can (I believe) run out of memory at 
some point. IIRC, the answer was to reduce the size of the QPs so that you 
could support a larger number of them.

You should find info about controlling QP size in the IB FAQ area on the OMPI 
web site, I believe.

On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote:

> Hello, 
> 
> Thank for the answer.
> I am testing with OpenMPI-1.4.3: my computation is queuing. But I did not 
> read anything obvious related to my issue. Have you read something which 
> could solve it? 
> I am going to submit my computation with --mca mpi_leave_pinned 0, but do you 
> have any idea how it affect the performance? Compared to using Ethernet? 
> 
> Many thanks for your support. 
> 
> On 06/23/2011 03:01 PM, Josh Hursey wrote:
>> 
>> I wonder if this is related to memory pinning. Can you try turning off
>> the leave pinned, and see if the problem persists (this may affect
>> performance, but should avoid the crash):
>>   mpirun ... --mca mpi_leave_pinned 0 ...
>> 
>> Also it looks like Smoky has a slightly newer version of the 1.4
>> branch that you should try to switch to if you can. The following
>> command will show you all of the available installs on that machine:
>>   shell$ module avail ompi
>> 
>> For a list of supported compilers for that version try the 'show' option:
>> shell$ module show ompi/1.4.3
>> ---
>> /sw/smoky/modulefiles-centos/ompi/1.4.3:
>> 
>> module-whatis This module configures your environment to make Open
>> MPI 1.4.3 available.
>> Supported Compilers:
>>  pathscale/3.2.99
>>  pathscale/3.2
>>  pgi/10.9
>>  pgi/10.4
>>  intel/11.1.072
>>  gcc/4.4.4
>>  gcc/4.4.3
>> ---
>> 
>> Let me know if that helps.
>> 
>> Josh
>> 
>> 
>> On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
>>  wrote:
>>> Dear all,
>>> 
>>> First of all, all my apologies because I post this message to both the bug
>>> and user mailing list. But for the moment, I do not know if it is a bug!
>>> 
>>> I am running a CFD structured flow solver at ORNL, and I have an access to a
>>> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
>>> Recently we increased the size of our models, and since that time we have
>>> run into many infiniband related problems.  The most serious problem is a
>>> hard crash with the following error message:
>>> 
>>> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>>> error creating qp errno says Cannot allocate memory
>>> 
>>> If we force the solver to use ethernet (mpirun -mca btl ^openib) the
>>> computations works correctly, although very slowly (a single iteration take
>>> ages). Do you have any idea what could be causing these problems?
>>> 
>>> If it is due to a bug or a limitation into OpenMPI, do you think the version
>>> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
>>> the release notes, but I did not read any obvious patch which could fix my
>>> problem. The system administrator is ready to compile a new package for us,
>>> but I do not want to ask to install to many of them.
>>> 
>>> Thanks.
>>> --
>>> 
>>> Mathieu Gontier
>>> skype: mathieu_gontier
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
> 
> -- 
> 
> Mathieu Gontier 
> skype: mathieu_gontier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Mathieu Gontier

Hello,

Thank for the answer.
I am testing with OpenMPI-1.4.3: my computation is queuing. But I did 
not read anything obvious related to my issue. Have you read something 
which could solve it?
I am going to submit my computation with --mca mpi_leave_pinned 0, but 
do you have any idea how it affect the performance? Compared to using 
Ethernet?


Many thanks for your support.

On 06/23/2011 03:01 PM, Josh Hursey wrote:

I wonder if this is related to memory pinning. Can you try turning off
the leave pinned, and see if the problem persists (this may affect
performance, but should avoid the crash):
   mpirun ... --mca mpi_leave_pinned 0 ...

Also it looks like Smoky has a slightly newer version of the 1.4
branch that you should try to switch to if you can. The following
command will show you all of the available installs on that machine:
   shell$ module avail ompi

For a list of supported compilers for that version try the 'show' option:
shell$ module show ompi/1.4.3
---
/sw/smoky/modulefiles-centos/ompi/1.4.3:

module-whatisThis module configures your environment to make Open
MPI 1.4.3 available.
Supported Compilers:
  pathscale/3.2.99
  pathscale/3.2
  pgi/10.9
  pgi/10.4
  intel/11.1.072
  gcc/4.4.4
  gcc/4.4.3
---

Let me know if that helps.

Josh


On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
  wrote:

Dear all,

First of all, all my apologies because I post this message to both the bug
and user mailing list. But for the moment, I do not know if it is a bug!

I am running a CFD structured flow solver at ORNL, and I have an access to a
small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
Recently we increased the size of our models, and since that time we have
run into many infiniband related problems.  The most serious problem is a
hard crash with the following error message:

[smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
error creating qp errno says Cannot allocate memory

If we force the solver to use ethernet (mpirun -mca btl ^openib) the
computations works correctly, although very slowly (a single iteration take
ages). Do you have any idea what could be causing these problems?

If it is due to a bug or a limitation into OpenMPI, do you think the version
1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
the release notes, but I did not read any obvious patch which could fix my
problem. The system administrator is ready to compile a new package for us,
but I do not want to ask to install to many of them.

Thanks.
--

Mathieu Gontier
skype: mathieu_gontier
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






--
/
Mathieu Gontier
skype: mathieu_gontier /


Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Josh Hursey
I wonder if this is related to memory pinning. Can you try turning off
the leave pinned, and see if the problem persists (this may affect
performance, but should avoid the crash):
  mpirun ... --mca mpi_leave_pinned 0 ...

Also it looks like Smoky has a slightly newer version of the 1.4
branch that you should try to switch to if you can. The following
command will show you all of the available installs on that machine:
  shell$ module avail ompi

For a list of supported compilers for that version try the 'show' option:
shell$ module show ompi/1.4.3
---
/sw/smoky/modulefiles-centos/ompi/1.4.3:

module-whatisThis module configures your environment to make Open
MPI 1.4.3 available.
Supported Compilers:
 pathscale/3.2.99
 pathscale/3.2
 pgi/10.9
 pgi/10.4
 intel/11.1.072
 gcc/4.4.4
 gcc/4.4.3
---

Let me know if that helps.

Josh


On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
 wrote:
> Dear all,
>
> First of all, all my apologies because I post this message to both the bug
> and user mailing list. But for the moment, I do not know if it is a bug!
>
> I am running a CFD structured flow solver at ORNL, and I have an access to a
> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
> Recently we increased the size of our models, and since that time we have
> run into many infiniband related problems.  The most serious problem is a
> hard crash with the following error message:
>
> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> error creating qp errno says Cannot allocate memory
>
> If we force the solver to use ethernet (mpirun -mca btl ^openib) the
> computations works correctly, although very slowly (a single iteration take
> ages). Do you have any idea what could be causing these problems?
>
> If it is due to a bug or a limitation into OpenMPI, do you think the version
> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
> the release notes, but I did not read any obvious patch which could fix my
> problem. The system administrator is ready to compile a new package for us,
> but I do not want to ask to install to many of them.
>
> Thanks.
> --
>
> Mathieu Gontier
> skype: mathieu_gontier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey