I’ve recompiled 3.1.1 with —enable-debug —enable-mem-debug, and I still get no 
detailed information from the mpi libraries, only VASP (as before):

ldd (at runtime, so I’m fairly sure it’s referring to the right executable and 
LD_LIBRARY_PATH) info:
vexec /usr/local/vasp/bin/5.4.4/0test/vasp.gamma_para.intel
        linux-vdso.so.1 =>  (0x00007ffd869f6000)
        libmkl_intel_lp64.so => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_intel_lp64.so
 (0x00002b0b70015000)
        libmkl_sequential.so => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_sequential.so
 (0x00002b0b70a56000)
        libmkl_core.so => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_core.so
 (0x00002b0b717ef000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000366a000000)
        libmpi_usempif08.so.40 => 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempif08.so.40
 (0x00002b0b732f3000)
        libmpi_usempi_ignore_tkr.so.40 => 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempi_ignore_tkr.so.40
 (0x00002b0b73535000)
        libmpi_mpifh.so.40 => 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40 
(0x00002b0b73737000)
        libmpi.so.40 => 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40 
(0x00002b0b73991000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003f5b400000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f5ac00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003f5a800000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003f5a400000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003669800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003f5a000000)
        libopen-rte.so.40 => 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-rte.so.40 
(0x00002b0b73d48000)
        libopen-pal.so.40 => 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40 
(0x00002b0b74066000)
        libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003f5bc00000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003f5b000000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003f6c000000)
        libz.so.1 => /lib64/libz.so.1 (0x0000003f5b800000)
        libifport.so.5 => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifport.so.5
 (0x00002b0b743b8000)
        libifcore.so.5 => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcore.so.5
 (0x00002b0b745e7000)
        libimf.so => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libimf.so
 (0x00002b0b74948000)
        libsvml.so => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libsvml.so
 (0x00002b0b74e35000)
        libintlc.so.5 => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libintlc.so.5
 (0x00002b0b75d40000)
        libifcoremt.so.5 => 
/usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcoremt.so.5
 (0x00002b0b75faa000)
ompi info (using same path as indicated by ldd output)
tin 1125 : 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/bin/ompi_info | grep 
debug
                  Prefix: 
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080
  Configure command line: 
'--prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080' 
'--with-tm=/usr/local/torque' '--enable-mpirun-prefix-by-default' 
'--with-verbs=/usr' '--with-verbs-libdir=/usr/lib64' '--enable-debug' 
'--enable-mem-debug'
  Internal debug support: yes
Memory debugging support: yes
resulting stack trace:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
vasp.gamma_para.i  0000000002DCE8C1  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002DCC9FB  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002D409E4  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002D407F6  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002CDCED9  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002CE3DB6  Unknown               Unknown  Unknown
libpthread-2.12.s  0000003F5AC0F7E0  Unknown               Unknown  Unknown
mca_btl_vader.so   00002AD17AC74CB8  Unknown               Unknown  Unknown
mca_btl_vader.so   00002AD17AC770F5  Unknown               Unknown  Unknown
libopen-pal.so.40  00002AD168B816A4  opal_progress         Unknown  Unknown
libmpi.so.40.10.1  00002AD1684D0D75  Unknown               Unknown  Unknown
libmpi.so.40.10.1  00002AD1684D0DB8  ompi_request_defa     Unknown  Unknown
libmpi.so.40.10.1  00002AD168571EBE  ompi_coll_base_se     Unknown  Unknown
libmpi.so.40.10.1  00002AD1685724B8  Unknown               Unknown  Unknown
libmpi.so.40.10.1  00002AD168573514  ompi_coll_base_al     Unknown  Unknown
mca_coll_tuned.so  00002AD17CD6C852  ompi_coll_tuned_a     Unknown  Unknown
libmpi.so.40.10.1  00002AD1684EE969  PMPI_Allreduce        Unknown  Unknown
libmpi_mpifh.so.4  00002AD1682595B7  mpi_allreduce_        Unknown  Unknown
vasp.gamma_para.i  000000000042D1ED  m_sum_d_                 1300  mpi.F
vasp.gamma_para.i  0000000001BD5293  david_mp_eddav_.R         778  davidson.F
vasp.gamma_para.i  0000000001D2179E  elmin_.R                  424  electron.F
vasp.gamma_para.i  0000000002B92452  vamp_IP_electroni        4783  main.F
vasp.gamma_para.i  0000000002B6E173  MAIN__                   2800  main.F
vasp.gamma_para.i  000000000041325E  Unknown               Unknown  Unknown
libc-2.12.so       0000003F5A41ED1D  __libc_start_main     Unknown  Unknown
vasp.gamma_para.i  0000000000413169  Unknown               Unknown  Unknown


I’ve checked ulimit -s (at runtime), and it is unlimited.  

I’m going to try the 3.1.x 20180710 nightly snapshot next.

Let me ask the source of the VASP inputs about sharing them.  Note that the 
crash really only happens at an appreciable rate running on 128 tasks (8x16 
core nodes), and even then, if I do a 10 geometry step run, only in about 1/3 
of all runs, so it’s not a completely trivial amount of resources to reproduce

                                                                                
        Noam

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to