Hi Gus Correa,

 the output of ulimit -a     is


----
file(blocks)         unlimited
coredump(blocks)     2048
data(kbytes)         unlimited
stack(kbytes)        10240
lockedmem(kbytes)    unlimited
memory(kbytes)       unlimited
nofiles(descriptors) 1024
processes            256
--------


Thanks

Mouhamad
Gus Correa <g...@ldeo.columbia.edu> a écrit :

Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

*   -   stack       -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:
Hi all,

  I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*      hard    stack   1000000
#*      soft    stack   1000000

# Dr 14.02.2008 : pour voltaire mpi
*      hard    memlock unlimited
*      soft    memlock unlimited



Many thanks for your help
Mouhamad

Gus Correa <g...@ldeo.columbia.edu> a écrit :

Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock     -1
*   -   stack       -1
*   -   nofile      4096

My two cents,
Gus Correa

Ralph Castain wrote:
Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:

Hi again,

This is exactly the error I have:

----
taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffffffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8] [part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc] [part034:21443] [ 5] wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3] [part034:21443] [ 6] wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) [0xe0e4f5]
[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a] [part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4]
[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
-------

Mouhamad
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Reply via email to