Re: [OMPI users] Icreasing OFED registerable memory

2015-01-08 Thread Waleed Lotfy
You are right.

I didn't know that SGE used limits other than '/etc/security/limits.conf', even 
though you explained it :/

The resolution is by adding 'H_MEMORYLOCKED=unlimited' in the execd_params.

Thank you all for your time and efforts and keep up the great work :)

Waleed Lotfy
Bibliotheca Alexandrina

From: users [users-boun...@open-mpi.org] on behalf of Gus Correa 
[g...@ldeo.columbia.edu]
Sent: Tuesday, January 06, 2015 9:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Icreasing OFED registerable memory

Hi Waleed

As Devendar said (and I tried to explain before),
you need to allow the locked memory limit to be unlimited for
user processes (in /etc/security/limits.conf),
*AND* somehow the daemon/job_script/whatever that launches the mpiexec
command must request "ulimit -l unlimited" (directly or indirectly).
The latter part depends on how your system's details.
I am not familiar to SGE (I use Torque), but presumably you can
add "ulimit -l unlimited" when you launch
the SGE daemons on the nodes.
Presumably this will make the processes launched by that daemon
(i.e. your mpiexec) inherit those limits,
and that is how I do it on Torque.
A more brute force way is just to include "ulimit -l unlimited"
in you job script before mpiexec.
Inserting a "ulimit -a" in your jobscript may help diagnose what you
actually have.
Please, see the OMPI FAQ that I sent you before for more details.

I hope this helps,
Gus Correa

On 01/06/2015 01:37 PM, Deva wrote:
> Hi Waleed,
>
> --
>Memlock limit: 65536
> --
>
> such a low limit should be due to per-user lock memory limit . Can you
> make sure it is  set to "unlimited" on all nodes ( "ulimit -l unlimited")?
>
> -Devendar
>
> On Tue, Jan 6, 2015 at 3:42 AM, Waleed Lotfy <waleed.lo...@bibalex.org
> <mailto:waleed.lo...@bibalex.org>> wrote:
>
> Hi guys,
>
> Sorry for getting back so late, but we ran into some problems during
> the installation process and as soon as the system came up I tested
> the new versions for the problem but it showed another memory
> related warning.
>
> --
> The OpenFabrics (openib) BTL failed to initialize while trying to
> allocate some locked memory.  This typically can indicate that the
> memlock limits are set too low.  For most HPC installations, the
> memlock limits should be set to "unlimited".  The failure occured
> here:
>
>Local host:comp003.local
>OMPI source:   btl_openib_component.c:1200
>Function:  ompi_free_list_init_ex_new()
>Device:mlx4_0
>Memlock limit: 65536
>
> You may need to consult with your system administrator to get this
> problem fixed.  This FAQ entry on the Open MPI web site may also be
> helpful:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> --
> --
> WARNING: There was an error initializing an OpenFabrics device.
>
>Local host:   comp003.local
>Local device: mlx4_0
> --
>
> <<>>
>
> My current running versions:
>
> OpenMPI: 1.6.4
> OFED-internal-2.3-2
>
> I checked /etc/security/limits.d/, the scheduler's configurations
> (grid engine) and tried adding the following line to
> /etc/modprobe.d/mlx4_core: 'options mlx4_core log_num_mtt=22
> log_mtts_per_seg=1' as suggested by Gus.
>
> I am running out of ideas here, so please any help is appreciated.
>
> P.S. I am not sure if I should open a new thread with this issue or
> continue with the current one, so please advice.
>
> Waleed Lotfy
> Bibliotheca Alexandrina
> ___
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26107.php
>
>
>
>
> --
>
>
> -Devendar
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/01/26109.php
>

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/01/26111.php


Re: [OMPI users] Icreasing OFED registerable memory

2015-01-06 Thread Waleed Lotfy
Hi guys,

Sorry for getting back so late, but we ran into some problems during the 
installation process and as soon as the system came up I tested the new 
versions for the problem but it showed another memory related warning.

--
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:comp003.local
  OMPI source:   btl_openib_component.c:1200
  Function:  ompi_free_list_init_ex_new()
  Device:mlx4_0
  Memlock limit: 65536

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   comp003.local
  Local device: mlx4_0
--

<<>>

My current running versions:

OpenMPI: 1.6.4
OFED-internal-2.3-2

I checked /etc/security/limits.d/, the scheduler's configurations (grid engine) 
and tried adding the following line to /etc/modprobe.d/mlx4_core: 'options 
mlx4_core log_num_mtt=22 log_mtts_per_seg=1' as suggested by Gus.

I am running out of ideas here, so please any help is appreciated.

P.S. I am not sure if I should open a new thread with this issue or continue 
with the current one, so please advice.

Waleed Lotfy
Bibliotheca Alexandrina


Re: [OMPI users] Icreasing OFED registerable memory

2014-12-31 Thread Waleed Lotfy
Thanks Gus for your help.

We have been working on upgrading OFED and OMPI last few days, so I don't have 
access to nodes running the outdated OFED at the moment and the updated ones 
should be ready to test today.

I remember checking limits.conf and setting it to unlimited but the warning 
kept showing up.

We use grid engine and I set the memory unlimited. However, I don't think the 
scheduler has anything to do with the problem since I tried to run an MPI job 
directly and the same warning appeared.

Adding these parameters yielded an error for the option 'log_mtts_per_seg', I 
can't recall the error exactly but it was something like option not recognized 
or not supported. And setting 'log_num_mtt', as mentioned before, causes ib0 
interface to fail.

I'll report back what happens on the updated versions.

Waleed Lotfy
Bibliotheca Alexandrina

From: users [users-boun...@open-mpi.org] on behalf of Gus Correa 
[g...@ldeo.columbia.edu]
Sent: Tuesday, December 30, 2014 8:01 PM
To: Open MPI Users
Subject: Re: [OMPI users] Icreasing OFED registerable memory

Hi Waleed

Even before any OFED upgrades, you could try the items
in the list below.
I have OMPI 1.6.5 and 1.8.3 working with an older OFED version,
with those settings.
That is not really OMPI fault, but Infinband/OFED's.

1) Make sure your locked memory is set to unlimited in
/etc/security/limits.conf

For instance:

*   softmemlock unlimited
*   hardmemlock unlimited


2) If you are using a queue system, make sure it sets the
locked memory to unlimited, so that all child processes
(including your mpiexec and mpi executable) will get it.

For instance, in Torque /etc/init.d/pbs_mom
or in /etc/sysconfig/pbs_mom:

# locked memory
ulimit -l unlimited

3) Add the parameters below to
/etc/modprobe.d/mlx4_core.conf

options mlx4_core log_num_mtt=22 log_mtts_per_seg=1

Do this with care, as the settings vary according to the physical RAM.
In addition,  the parameters seem to have been deprecated in 3.X
kernels, which makes this tricky.

See these FAQs:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

***
Having said that, a question remains unanswered:
Why is Infiniband such a nightmare?
***

I hope this helps,
Gus Correa

On 12/30/2014 09:16 AM, Waleed Lotfy wrote:
> Thank Devendar for your response.
>
> I'll test it on a new installation with OFED 2.3.2 and OMPI v1.6.5. If it 
> didn't work I'll give 1.8.4 a try.
>
> Thank you for your help and I'll get back to you with hopefully good results.
>
> Waleed Lotfy
> Bibliotheca Alexandrina
> 
> From: users [users-boun...@open-mpi.org] on behalf of Deva 
> [devendar.bure...@gmail.com]
> Sent: Monday, December 29, 2014 8:29 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Icreasing OFED registerable memory
>
> Hi Waleed,
>
> It is highly recommended to upgrade to latest OFED.  Meanwhile, Can you try 
> latest OMPI release (v1.8.4), where this warning is ignored on older OFEDs
>
> -Devendar
>
> On Sun, Dec 28, 2014 at 6:03 AM, Waleed Lotfy 
> <waleed.lo...@bibalex.org<mailto:waleed.lo...@bibalex.org>> wrote:
> I have a bunch of 8 GB memory nodes in a cluster who were lately
> upgraded to 16 GB. When I run any jobs I get the following warning:
> --
> WARNING: It appears that your OpenFabrics subsystem is configured to
> only
> allow registering part of your physical memory.  This can cause MPI jobs
> to
> run with erratic performance, hang, and/or crash.
>
> This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered.  You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
> See this Open MPI FAQ item for more information on these Linux kernel
> module
> parameters:
>
>  http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
>Local host:  comp022.local
>Registerable memory: 8192 MiB
>Total memory:16036 MiB
>
> Your MPI job will continue, but may be behave poorly and/or hang.
> --
>
> Searching for a fix to this issue, I found that I have to set
> log_num_mtt within the kernel module, so I added this line to
> modprobe.conf:
>
> options mlx4_core log_num_mtt=21
>
> But then ib0 inter

Re: [OMPI users] Icreasing OFED registerable memory

2014-12-30 Thread Waleed Lotfy
Thank Devendar for your response.

I'll test it on a new installation with OFED 2.3.2 and OMPI v1.6.5. If it 
didn't work I'll give 1.8.4 a try.

Thank you for your help and I'll get back to you with hopefully good results.

Waleed Lotfy
Bibliotheca Alexandrina

From: users [users-boun...@open-mpi.org] on behalf of Deva 
[devendar.bure...@gmail.com]
Sent: Monday, December 29, 2014 8:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Icreasing OFED registerable memory

Hi Waleed,

It is highly recommended to upgrade to latest OFED.  Meanwhile, Can you try 
latest OMPI release (v1.8.4), where this warning is ignored on older OFEDs

-Devendar

On Sun, Dec 28, 2014 at 6:03 AM, Waleed Lotfy 
<waleed.lo...@bibalex.org<mailto:waleed.lo...@bibalex.org>> wrote:
I have a bunch of 8 GB memory nodes in a cluster who were lately
upgraded to 16 GB. When I run any jobs I get the following warning:
--
WARNING: It appears that your OpenFabrics subsystem is configured to
only
allow registering part of your physical memory.  This can cause MPI jobs
to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel
module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:  comp022.local
  Registerable memory: 8192 MiB
  Total memory:16036 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--

Searching for a fix to this issue, I found that I have to set
log_num_mtt within the kernel module, so I added this line to
modprobe.conf:

options mlx4_core log_num_mtt=21

But then ib0 interface fails to start showing this error:
ib_ipoib device ib0 does not seem to be present, delaying
initialization.

Reducing the value of log_num_mtt to 20, allows ib0 to start but shows
the registerable memory of 8 GB warning.

I am using OFED 1.3.1, I know it is pretty old and we are planning to
upgrade soon.

Output on all nodes for 'ompi_info  -v ompi full --parsable':

ompi:version:full:1.2.7
ompi:version:svn:r19401
orte:version:full:1.2.7
orte:version:svn:r19401
opal:version:full:1.2.7
opal:version:svn:r19401

Any help would be appreciated.

Waleed Lotfy
Bibliotheca Alexandrina
___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/12/26076.php



--


-Devendar


[OMPI users] Icreasing OFED registerable memory

2014-12-28 Thread Waleed Lotfy
I have a bunch of 8 GB memory nodes in a cluster who were lately
upgraded to 16 GB. When I run any jobs I get the following warning:
--
WARNING: It appears that your OpenFabrics subsystem is configured to
only
allow registering part of your physical memory.  This can cause MPI jobs
to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel
module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:  comp022.local
  Registerable memory: 8192 MiB
  Total memory:16036 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--

Searching for a fix to this issue, I found that I have to set
log_num_mtt within the kernel module, so I added this line to
modprobe.conf:

options mlx4_core log_num_mtt=21

But then ib0 interface fails to start showing this error:
ib_ipoib device ib0 does not seem to be present, delaying
initialization.

Reducing the value of log_num_mtt to 20, allows ib0 to start but shows
the registerable memory of 8 GB warning.

I am using OFED 1.3.1, I know it is pretty old and we are planning to
upgrade soon.

Output on all nodes for 'ompi_info  -v ompi full --parsable':

ompi:version:full:1.2.7
ompi:version:svn:r19401
orte:version:full:1.2.7
orte:version:svn:r19401
opal:version:full:1.2.7
opal:version:svn:r19401

Any help would be appreciated.

Waleed Lotfy
Bibliotheca Alexandrina