I've a problem running a well tested MPI based application. The program has been used for years with no problems. Suddenly the executable which was run many times with no problems crashed with SIGSEGV. The very same executable if run with root privileges works OK. The same happens with other executables and across various recompilation attempts.
We could not find any relevant difference in the O.S. since a few days ago when the program worked also under unprivileged user ID. Actually about in the same span of time we changed the GID of the user experiencing the fault, but we think this is not relevant because the same SIGSEGV happens to another user which was not modified. Moreover we cannot see how that change can affect the running executabe (we checked all file permissions in the directory tree where the program is used). Running the program under GDB we get the trace reported below. The segfault happens at the very beginning during MPI initialization. We can use the program with sudo, but I'd like to find out what happened to go back to "normal" usage. I'd appreciate any hint on the issue. Many thanks, Luca Fini ============================== Here follows a few environment details: Program started with: mpirun -debug -debugger gdb -np 1 /home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD OPEN-MPI 1.6.5 Linux 2.6.32-431.29.2.2.6.32-431.29.2.el6.x86_64 Intel fortran Compiler: 2011.7.256 ========================= Here follows the stack trace: Starting program: /home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD /home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD [Thread debugging using libthread_db enabled] Program received signal SIGSEGV, Segmentation fault. 0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0, type=0x3b914a7fb5 "rte", static_components=0x3b916cb040, requested_component_names=0x0, include_mode=128, found_components=0x1, open_dso_components=16) at mca_base_component_find.c:162 162 OBJ_CONSTRUCT(found_components, opal_list_t); Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libgfortran-4.4.7-11.el6.x86_64 libtool-ltdl-2.2.6-15.5.el6.x86_64 openmpi-1.8.1-1.el6.x86_64 (gdb) where #0 0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0, type=0x3b914a7fb5 "rte", static_components=0x3b916cb040, requested_component_names=0x0, include_mode=128, found_components=0x1, open_dso_components=16) at mca_base_component_find.c:162 #1 0x0000003b90c4870a in mca_base_framework_components_register () from /usr/lib64/openmpi/lib/libopen-pal.so.6 #2 0x0000003b90c48c06 in mca_base_framework_register () from /usr/lib64/openmpi/lib/libopen-pal.so.6 #3 0x0000003b90c48def in mca_base_framework_open () from /usr/lib64/openmpi/lib/libopen-pal.so.6 #4 0x0000003b914407e7 in ompi_mpi_init () from /usr/lib64/openmpi/lib/libmpi.so.1 #5 0x0000003b91463200 in PMPI_Init () from /usr/lib64/openmpi/lib/libmpi.so.1 #6 0x00002aaaaacd9295 in mpi_init_f (ierr=0x7fffffffd268) at pinit_f.c:75 #7 0x00000000005bb159 in MODE_MNH_WORLD::init_nmnh_comm_world (kinfo_ll=Cannot access memory at address 0x0 ) at /home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_mnh_world.f90:45 #8 0x00000000005939d3 in MODE_IO_LL::initio_ll () at /home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_io_ll.f90:107 #9 0x000000000049d02f in prep_pgd () at /home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_prep_pgd.f90:130 #10 0x000000000049cf8c in main () -- Luca Fini. INAF - Oss. Astrofisico di Arcetri L.go E.Fermi, 5. 50125 Firenze. Italy Tel: +39 055 2752 307 Fax: +39 055 2752 292 Skype: l.fini Web: http://www.arcetri.inaf.it/~lfini