Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-26 Thread Mouhamad Al-Sayed-Ali

Hi Gus;

  I have done as uou suggest me but it always doesn't work!

Many thanks for your help


Mouhamad
Gus Correa  a écrit :


Hi Mouhamad

Stack of 10240kB is probably the Linux default,
not necessarily good for HPC and number crunching.
I'd suggest that you change it to unlimited,
unless your system administrator has a very good reason not to do
so.
We've seen many atmosphre/ocean/climate models crash because
they couldn't allocate memory on the stack [automatic arrays
in subroutines, etc].

This has nothing to do with MPI,
the programs can fail even when they run in serial mode
because of this.

You can just append this line to /etc/security/limits.conf:

*   -   stack   -1


I hope this helps,
Gus Correa


Mouhamad Al-Sayed-Ali wrote:

Hi Gus Correa,

the output of ulimit -a is



file(blocks) unlimited
coredump(blocks) 2048
data(kbytes) unlimited
stack(kbytes)10240
lockedmem(kbytes)unlimited
memory(kbytes)   unlimited
nofiles(descriptors) 1024
processes256



Thanks

Mouhamad
Gus Correa  a écrit :


Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

*   -   stack   -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:

Hi all,

 I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa  a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1]  
wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
[part034:21443] [ 2]  
wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31)  
[0x11e6e41]
[part034:21443] [ 4]  
wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9]  
wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]
[part034:21443] [10]  
wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]

[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-26 Thread Gus Correa

Hi Mouhamad

Stack of 10240kB is probably the Linux default,
not necessarily good for HPC and number crunching.
I'd suggest that you change it to unlimited,
unless your system administrator has a very good reason not to do
so.
We've seen many atmosphre/ocean/climate models crash because
they couldn't allocate memory on the stack [automatic arrays
in subroutines, etc].

This has nothing to do with MPI,
the programs can fail even when they run in serial mode
because of this.

You can just append this line to /etc/security/limits.conf:

*   -   stack   -1


I hope this helps,
Gus Correa


Mouhamad Al-Sayed-Ali wrote:

Hi Gus Correa,

 the output of ulimit -a is



file(blocks) unlimited
coredump(blocks) 2048
data(kbytes) unlimited
stack(kbytes)10240
lockedmem(kbytes)unlimited
memory(kbytes)   unlimited
nofiles(descriptors) 1024
processes256



Thanks

Mouhamad
Gus Correa  a écrit :


Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

*   -   stack   -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:

Hi all,

  I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa  a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) 
[0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) 
[0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) 
[0x11e6e41]
[part034:21443] [ 4] 
wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5] 
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) 
[0xcc4ed3]
[part034:21443] [ 6] 
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) 
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9] 
wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]
[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) 
[0x47a924]

[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users








Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-26 Thread Mouhamad Al-Sayed-Ali

Hi Gus Correa,

 the output of ulimit -a is



file(blocks) unlimited
coredump(blocks) 2048
data(kbytes) unlimited
stack(kbytes)10240
lockedmem(kbytes)unlimited
memory(kbytes)   unlimited
nofiles(descriptors) 1024
processes256



Thanks

Mouhamad
Gus Correa  a écrit :


Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

*   -   stack   -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:

Hi all,

  I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa  a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418)  
[0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260)  
[0x11cfca0]

[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4]  
wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9]  
wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]
[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24)  
[0x47a924]

[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Gus Correa

Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

*   -   stack   -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:

Hi all,

   I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa  a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) 
[0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) 
[0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) 
[0x11e6e41]
[part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) 
[0x11e9bcc]
[part034:21443] [ 5] 
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) 
[0xcc4ed3]
[part034:21443] [ 6] 
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) 
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) 
[0x4b2c4a]
[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) 
[0x47a924]

[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users








Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

Hi all,

   I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa  a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4]  
wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9]  
wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]

[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Gus Correa

Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5] 
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3]
[part034:21443] [ 6] 
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) [0xe0e4f5]
[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]
[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4]
[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Ralph Castain
Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:

> Hi again,
> 
> This is exactly the error I have:
> 
> 
> taskid: 0 hostname: part034.u-bourgogne.fr
> [part034:21443] *** Process received signal ***
> [part034:21443] Signal: Segmentation fault (11)
> [part034:21443] Signal code: Address not mapped (1)
> [part034:21443] Failing at address: 0xfffe01eeb340
> [part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
> [part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
> [part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
> [part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
> [part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) 
> [0x11e9bcc]
> [part034:21443] [ 5] 
> wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3]
> [part034:21443] [ 6] 
> wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) 
> [0xe0e4f5]
> [part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
> [part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
> [part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) 
> [0x4b2c4a]
> [part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
> [part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
> [part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4]
> [part034:21443] [13] wrf.exe [0x4793c9]
> [part034:21443] *** End of error message ***
> ---
> 
> Mouhamad
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread TERRY DONTJE

This looks more like a seg fault in wrf and not OMPI.

Sorry not much I can do here to help you.

--td

On 10/25/2011 9:53 AM, Mouhamad Al-Sayed-Ali wrote:

Hi again,

 This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) 
[0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) 
[0x11cfca0]

[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) 
[0x11e9bcc]
[part034:21443] [ 5] 
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3]
[part034:21443] [ 6] 
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) 
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) 
[0x4b2c4a]
[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) 
[0x47a924]

[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

Hi again,

 This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec)  
[0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236)  
[0x4b2c4a]

[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4]
[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad


Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

Hello


can you run wrf successfully on one node?


NO, It can't run on one node

Can you run a simple code across your two nodes?  I would try  
hostname then some simple MPI program like the ring example.

Yes, I can run a simple code

many thanks

Mouhamad






Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread TERRY DONTJE

Can you run wrf successfully on one node?
Can you run a simple code across your two nodes?  I would try hostname 
then some simple MPI program like the ring example.


--td

On 10/25/2011 9:05 AM, Mouhamad Al-Sayed-Ali wrote:

hello,


-What version of ompi are you using

  I am using ompi version 1.4.1-1 compiled with gcc 4.5


-What type of machine and os are you running on

   I'm using linux machine 64 bits.


-What does the machine file look like

  part033
  part033
  part031
  part031


-Is there a stack trace left behind by the pid that seg faulted?

  No, there is no stack trace


Thanks for your help

Mouhamad Alsayed


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

hello,


-What version of ompi are you using

  I am using ompi version 1.4.1-1 compiled with gcc 4.5


-What type of machine and os are you running on

   I'm using linux machine 64 bits.


-What does the machine file look like

  part033
  part033
  part031
  part031


-Is there a stack trace left behind by the pid that seg faulted?

  No, there is no stack trace


Thanks for your help

Mouhamad Alsayed


Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread TERRY DONTJE

Some more info would be nice like:
-What version of ompi are you using
-What type of machine and os are you running on
-What does the machine file look like
-Is there a stack trace left behind by the pid that seg faulted?

--td

On 10/25/2011 8:07 AM, Mouhamad Al-Sayed-Ali wrote:

Hello,

I have tried to run the executable "wrf.exe", using

  mpirun -machinefile /tmp/108388.1.par2/machines -np 4 wrf.exe

but, I've got the following error:

-- 

mpirun noticed that process rank 1 with PID 9942 on node 
part031.u-bourgogne.fr exited on signal 11 (Segmentation fault).
-- 


   11.54s real 6.03s user 0.32s system
Starter(9908): Return code=139
Starter end(9908)




Thanks for your help


Mouhamad Alsayed


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com