Hi Luca

Another possibility that comes to mind,
besides mixed versions mentioned by Gilles,
is the OS limits.
Limits may vary according to the user and user privileges.

Large programs tend to require big stacksize (even unlimited),
and typically segfault when the stack is not large enough.
Max number of open files is yet another hurdle.
And if you're using Infinband, the max locked memory size should be unlimited.
Check /etc/security/limits.conf and "ulimit -a".

I hope this helps,
Gus Correa

On 12/10/2014 08:28 AM, Gilles Gouaillardet wrote:
Luca,

your email mentions openmpi 1.6.5
but gdb output points to openmpi 1.8.1.

could the root cause be a mix of versions that does not occur with root
account ?

which openmpi version are you expecting ?

you can run
pmap <pid>
when your binary is running and/or under gdb to confirm the openmpi
library that is really used

Cheers,

Gilles

On Wed, Dec 10, 2014 at 7:21 PM, Luca Fini <lf...@arcetri.astro.it
<mailto:lf...@arcetri.astro.it>> wrote:

    I've a problem running a well tested MPI based application.

    The program has been used for years with no problems. Suddenly the
    executable which was run many times with no problems crashed with
    SIGSEGV. The very same executable if run with root privileges works
    OK. The same happens with other executables and across various
    recompilation attempts.

    We could not find any relevant difference in the O.S. since a few days
    ago when the program worked also under unprivileged user ID. Actually
    about in the same span of time we changed the GID of the user
    experiencing the fault, but we think this is not relevant because the
    same SIGSEGV happens to another user which was not modified. Moreover
    we cannot see how that change can affect the running executabe (we
    checked all file permissions in the directory tree where the program
    is used).

    Running the program under GDB we get the trace reported below. The
    segfault happens at the very beginning during MPI initialization.

    We can use the program with sudo, but I'd like to find out what
    happened to go back to "normal" usage.

    I'd appreciate any hint on the issue.

    Many thanks,

                                Luca Fini

    ==============================
    Here follows a few environment details:

    Program started with: mpirun -debug -debugger gdb  -np 1
    
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD

    OPEN-MPI 1.6.5

    Linux 2.6.32-431.29.2.2.6.32-431.29.2.el6.x86_64

    Intel fortran Compiler: 2011.7.256

    =========================
    Here follows the stack trace:

    Starting program:
    
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD
    
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD
    [Thread debugging using libthread_db enabled]

    Program received signal SIGSEGV, Segmentation fault.
    0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0,
    type=0x3b914a7fb5 "rte", static_components=0x3b916cb040,
    requested_component_names=0x0, include_mode=128, found_components=0x1,
    open_dso_components=16)
         at mca_base_component_find.c:162
    162        OBJ_CONSTRUCT(found_components, opal_list_t);
    Missing separate debuginfos, use: debuginfo-install
    glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64
    libgfortran-4.4.7-11.el6.x86_64 libtool-ltdl-2.2.6-15.5.el6.x86_64
    openmpi-1.8.1-1.el6.x86_64
    (gdb) where
    #0  0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0,
    type=0x3b914a7fb5 "rte", static_components=0x3b916cb040,
    requested_component_names=0x0, include_mode=128, found_components=0x1,
    open_dso_components=16)
         at mca_base_component_find.c:162
    #1  0x0000003b90c4870a in mca_base_framework_components_register ()
    from /usr/lib64/openmpi/lib/libopen-pal.so.6
    #2  0x0000003b90c48c06 in mca_base_framework_register () from
    /usr/lib64/openmpi/lib/libopen-pal.so.6
    #3  0x0000003b90c48def in mca_base_framework_open () from
    /usr/lib64/openmpi/lib/libopen-pal.so.6
    #4  0x0000003b914407e7 in ompi_mpi_init () from
    /usr/lib64/openmpi/lib/libmpi.so.1
    #5  0x0000003b91463200 in PMPI_Init () from
    /usr/lib64/openmpi/lib/libmpi.so.1
    #6  0x00002aaaaacd9295 in mpi_init_f (ierr=0x7fffffffd268) at
    pinit_f.c:75
    #7  0x00000000005bb159 in MODE_MNH_WORLD::init_nmnh_comm_world
    (kinfo_ll=Cannot access memory at address 0x0
    ) at
    
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_mnh_world.f90:45
    #8  0x00000000005939d3 in MODE_IO_LL::initio_ll () at
    
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_io_ll.f90:107
    #9  0x000000000049d02f in prep_pgd () at
    
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_prep_pgd.f90:130
    #10 0x000000000049cf8c in main ()

    --
    Luca Fini.  INAF - Oss. Astrofisico di Arcetri
    L.go E.Fermi, 5. 50125 Firenze. Italy
    Tel: +39 055 2752 307 <tel:%2B39%20055%202752%20307>     Fax: +39
    055 2752 292 <tel:%2B39%20055%202752%20292>
    Skype: l.fini
    Web: http://www.arcetri.inaf.it/~lfini
    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2014/12/25945.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/12/25946.php


Reply via email to