I've a problem running a well tested MPI based application.

The program has been used for years with no problems. Suddenly the
executable which was run many times with no problems crashed with
SIGSEGV. The very same executable if run with root privileges works
OK. The same happens with other executables and across various
recompilation attempts.

We could not find any relevant difference in the O.S. since a few days
ago when the program worked also under unprivileged user ID. Actually
about in the same span of time we changed the GID of the user
experiencing the fault, but we think this is not relevant because the
same SIGSEGV happens to another user which was not modified. Moreover
we cannot see how that change can affect the running executabe (we
checked all file permissions in the directory tree where the program
is used).

Running the program under GDB we get the trace reported below. The
segfault happens at the very beginning during MPI initialization.

We can use the program with sudo, but I'd like to find out what
happened to go back to "normal" usage.

I'd appreciate any hint on the issue.

Many thanks,

                           Luca Fini

==============================
Here follows a few environment details:

Program started with: mpirun -debug -debugger gdb  -np 1
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD

OPEN-MPI 1.6.5

Linux 2.6.32-431.29.2.2.6.32-431.29.2.el6.x86_64

Intel fortran Compiler: 2011.7.256

=========================
Here follows the stack trace:

Starting program:
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0,
type=0x3b914a7fb5 "rte", static_components=0x3b916cb040,
requested_component_names=0x0, include_mode=128, found_components=0x1,
open_dso_components=16)
    at mca_base_component_find.c:162
162        OBJ_CONSTRUCT(found_components, opal_list_t);
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64
libgfortran-4.4.7-11.el6.x86_64 libtool-ltdl-2.2.6-15.5.el6.x86_64
openmpi-1.8.1-1.el6.x86_64
(gdb) where
#0  0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0,
type=0x3b914a7fb5 "rte", static_components=0x3b916cb040,
requested_component_names=0x0, include_mode=128, found_components=0x1,
open_dso_components=16)
    at mca_base_component_find.c:162
#1  0x0000003b90c4870a in mca_base_framework_components_register ()
from /usr/lib64/openmpi/lib/libopen-pal.so.6
#2  0x0000003b90c48c06 in mca_base_framework_register () from
/usr/lib64/openmpi/lib/libopen-pal.so.6
#3  0x0000003b90c48def in mca_base_framework_open () from
/usr/lib64/openmpi/lib/libopen-pal.so.6
#4  0x0000003b914407e7 in ompi_mpi_init () from
/usr/lib64/openmpi/lib/libmpi.so.1
#5  0x0000003b91463200 in PMPI_Init () from /usr/lib64/openmpi/lib/libmpi.so.1
#6  0x00002aaaaacd9295 in mpi_init_f (ierr=0x7fffffffd268) at pinit_f.c:75
#7  0x00000000005bb159 in MODE_MNH_WORLD::init_nmnh_comm_world
(kinfo_ll=Cannot access memory at address 0x0
) at 
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_mnh_world.f90:45
#8  0x00000000005939d3 in MODE_IO_LL::initio_ll () at
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_io_ll.f90:107
#9  0x000000000049d02f in prep_pgd () at
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_prep_pgd.f90:130
#10 0x000000000049cf8c in main ()

-- 
Luca Fini.  INAF - Oss. Astrofisico di Arcetri
L.go E.Fermi, 5. 50125 Firenze. Italy
Tel: +39 055 2752 307     Fax: +39 055 2752 292
Skype: l.fini
Web: http://www.arcetri.inaf.it/~lfini

Reply via email to