I’ve recompiled 3.1.1 with —enable-debug —enable-mem-debug, and I still get no detailed information from the mpi libraries, only VASP (as before):
ldd (at runtime, so I’m fairly sure it’s referring to the right executable and LD_LIBRARY_PATH) info: vexec /usr/local/vasp/bin/5.4.4/0test/vasp.gamma_para.intel linux-vdso.so.1 => (0x00007ffd869f6000) libmkl_intel_lp64.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_intel_lp64.so (0x00002b0b70015000) libmkl_sequential.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_sequential.so (0x00002b0b70a56000) libmkl_core.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_core.so (0x00002b0b717ef000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000366a000000) libmpi_usempif08.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempif08.so.40 (0x00002b0b732f3000) libmpi_usempi_ignore_tkr.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempi_ignore_tkr.so.40 (0x00002b0b73535000) libmpi_mpifh.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40 (0x00002b0b73737000) libmpi.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40 (0x00002b0b73991000) libm.so.6 => /lib64/libm.so.6 (0x0000003f5b400000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f5ac00000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003f5a800000) libc.so.6 => /lib64/libc.so.6 (0x0000003f5a400000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003669800000) /lib64/ld-linux-x86-64.so.2 (0x0000003f5a000000) libopen-rte.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-rte.so.40 (0x00002b0b73d48000) libopen-pal.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40 (0x00002b0b74066000) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003f5bc00000) librt.so.1 => /lib64/librt.so.1 (0x0000003f5b000000) libutil.so.1 => /lib64/libutil.so.1 (0x0000003f6c000000) libz.so.1 => /lib64/libz.so.1 (0x0000003f5b800000) libifport.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifport.so.5 (0x00002b0b743b8000) libifcore.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcore.so.5 (0x00002b0b745e7000) libimf.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libimf.so (0x00002b0b74948000) libsvml.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libsvml.so (0x00002b0b74e35000) libintlc.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libintlc.so.5 (0x00002b0b75d40000) libifcoremt.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcoremt.so.5 (0x00002b0b75faa000) ompi info (using same path as indicated by ldd output) tin 1125 : /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/bin/ompi_info | grep debug Prefix: /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080 Configure command line: '--prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080' '--with-tm=/usr/local/torque' '--enable-mpirun-prefix-by-default' '--with-verbs=/usr' '--with-verbs-libdir=/usr/lib64' '--enable-debug' '--enable-mem-debug' Internal debug support: yes Memory debugging support: yes resulting stack trace: forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source vasp.gamma_para.i 0000000002DCE8C1 Unknown Unknown Unknown vasp.gamma_para.i 0000000002DCC9FB Unknown Unknown Unknown vasp.gamma_para.i 0000000002D409E4 Unknown Unknown Unknown vasp.gamma_para.i 0000000002D407F6 Unknown Unknown Unknown vasp.gamma_para.i 0000000002CDCED9 Unknown Unknown Unknown vasp.gamma_para.i 0000000002CE3DB6 Unknown Unknown Unknown libpthread-2.12.s 0000003F5AC0F7E0 Unknown Unknown Unknown mca_btl_vader.so 00002AD17AC74CB8 Unknown Unknown Unknown mca_btl_vader.so 00002AD17AC770F5 Unknown Unknown Unknown libopen-pal.so.40 00002AD168B816A4 opal_progress Unknown Unknown libmpi.so.40.10.1 00002AD1684D0D75 Unknown Unknown Unknown libmpi.so.40.10.1 00002AD1684D0DB8 ompi_request_defa Unknown Unknown libmpi.so.40.10.1 00002AD168571EBE ompi_coll_base_se Unknown Unknown libmpi.so.40.10.1 00002AD1685724B8 Unknown Unknown Unknown libmpi.so.40.10.1 00002AD168573514 ompi_coll_base_al Unknown Unknown mca_coll_tuned.so 00002AD17CD6C852 ompi_coll_tuned_a Unknown Unknown libmpi.so.40.10.1 00002AD1684EE969 PMPI_Allreduce Unknown Unknown libmpi_mpifh.so.4 00002AD1682595B7 mpi_allreduce_ Unknown Unknown vasp.gamma_para.i 000000000042D1ED m_sum_d_ 1300 mpi.F vasp.gamma_para.i 0000000001BD5293 david_mp_eddav_.R 778 davidson.F vasp.gamma_para.i 0000000001D2179E elmin_.R 424 electron.F vasp.gamma_para.i 0000000002B92452 vamp_IP_electroni 4783 main.F vasp.gamma_para.i 0000000002B6E173 MAIN__ 2800 main.F vasp.gamma_para.i 000000000041325E Unknown Unknown Unknown libc-2.12.so 0000003F5A41ED1D __libc_start_main Unknown Unknown vasp.gamma_para.i 0000000000413169 Unknown Unknown Unknown I’ve checked ulimit -s (at runtime), and it is unlimited. I’m going to try the 3.1.x 20180710 nightly snapshot next. Let me ask the source of the VASP inputs about sharing them. Note that the crash really only happens at an appreciable rate running on 128 tasks (8x16 core nodes), and even then, if I do a 10 geometry step run, only in about 1/3 of all runs, so it’s not a completely trivial amount of resources to reproduce Noam
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users