Hi, I'm using mkl scalapack in my project. Recently, I was trying to run my application on new set of nodes. Unfortunately, when I try to execute more than about 20 processes, I get segmentation fault.
[compn7:03552] *** Process received signal *** [compn7:03552] Signal: Segmentation fault (11) [compn7:03552] Signal code: Address not mapped (1) [compn7:03552] Failing at address: 0x20b2e68 [compn7:03552] [ 0] /lib64/libpthread.so.0(+0xf3c0) [0x7f46e0fc33c0] [compn7:03552] [ 1] /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0xd577) [0x7f46dd093577] [compn7:03552] [ 2] /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x5b4c) [0x7f46dc5edb4c] [compn7:03552] [ 3] /home/gmaj/lib/openmpi/lib/libopen-pal.so.0(+0x1dbe8) [0x7f46e0679be8] [compn7:03552] [ 4] (home/gmaj/lib/openmpi/lib/libopen-pal.so.0(opal_progress+0xa1) [0x7f46e066dbf1] [compn7:03552] [ 5] /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5945) [0x7f46dd08b945] [compn7:03552] [ 6] /home/gmaj/lib/openmpi/lib/libmpi.so.0(MPI_Send+0x6a) [0x7f46e0b4f10a] [compn7:03552] [ 7] /home/gmaj/matrix/matrix(BI_Ssend+0x21) [0x49cc11] [compn7:03552] [ 8] /home/gmaj/matrix/matrix(BI_IdringBR+0x79) [0x49c579] [compn7:03552] [ 9] /home/gmaj/matrix/matrix(ilp64_Cdgebr2d+0x221) [0x495bb1] [compn7:03552] [10] /home/gmaj/matrix/matrix(Cdgebr2d+0xd0) [0x47ffb0] [compn7:03552] [11] /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so(PB_CInV2+0x1304) [0x7f46e27f5124] [compn7:03552] *** End of error message *** This error appears during some scalapack computation. My processes do some mpi communication before this error appears. I found out, that by modifying btl_tcp_eager_limit and btl_tcp_max_send_size parameters, I can run more processes - the smaller those values are, the more processes I can run. Unfortunately, by this method I've succeeded to run up to 30 processes, which is still far to small. Some clue may be what valgrind says: ==3894== Syscall param writev(vector[...]) points to uninitialised byte(s) ==3894== at 0x82D009B: writev (in /lib64/libc-2.12.90.so) ==3894== by 0xBA2136D: mca_btl_tcp_frag_send (in /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so) ==3894== by 0xBA203D0: mca_btl_tcp_endpoint_send (in /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so) ==3894== by 0xB003583: mca_pml_ob1_send_request_start_rdma (in /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==3894== by 0xAFFA7C9: mca_pml_ob1_send (in /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==3894== by 0x6D4B109: PMPI_Send (in /home/gmaj/lib/openmpi/lib/libmpi.so.0) ==3894== by 0x49CC10: BI_Ssend (in /home/gmaj/matrix/matrix) ==3894== by 0x49C578: BI_IdringBR (in /home/gmaj/matrix/matrix) ==3894== by 0x495BB0: ilp64_Cdgebr2d (in /home/gmaj/matrix/matrix) ==3894== by 0x47FFAF: Cdgebr2d (in /home/gmaj/matrix/matrix) ==3894== by 0x51B38E0: PB_CInV2 (in /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so) ==3894== by 0x51DB89B: PB_CpgemmAB (in /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so) ==3894== Address 0xadecdce is 461,886 bytes inside a block of size 527,544 alloc'd ==3894== at 0x4C2615D: malloc (vg_replace_malloc.c:195) ==3894== by 0x6D0BBA3: ompi_free_list_grow (in /home/gmaj/lib/openmpi/lib/libmpi.so.0) ==3894== by 0xBA1E1A4: mca_btl_tcp_component_init (in /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so) ==3894== by 0x6D5C909: mca_btl_base_select (in /home/gmaj/lib/openmpi/lib/libmpi.so.0) ==3894== by 0xB40E950: mca_bml_r2_component_init (in /home/gmaj/lib/openmpi/lib/openmpi/mca_bml_r2.so) ==3894== by 0x6D5C07E: mca_bml_base_init (in /home/gmaj/lib/openmpi/lib/libmpi.so.0) ==3894== by 0xAFF8A0E: mca_pml_ob1_component_init (in /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==3894== by 0x6D663B2: mca_pml_base_select (in /home/gmaj/lib/openmpi/lib/libmpi.so.0) ==3894== by 0x6D25D20: ompi_mpi_init (in /home/gmaj/lib/openmpi/lib/libmpi.so.0) ==3894== by 0x6D45987: PMPI_Init_thread (in /home/gmaj/lib/openmpi/lib/libmpi.so.0) ==3894== by 0x42490A: MPI::Init_thread(int&, char**&, int) (functions_inln.h:150) ==3894== by 0x41F483: main (matrix.cpp:83) I've tried to configure open-mpi with option --without-memory-manager, but it didn't help. I can successfully run exactly the same application on other machines, having the number of nodes even over 800. Does anyone have any idea how to further debug this issue? Any help would be appreciated. Thanks, Grzegorz Maj