Hi,
I'm using mkl scalapack in my project. Recently, I was trying to run
my application on new set of nodes. Unfortunately, when I try to
execute more than about 20 processes, I get segmentation fault.

[compn7:03552] *** Process received signal ***
[compn7:03552] Signal: Segmentation fault (11)
[compn7:03552] Signal code: Address not mapped (1)
[compn7:03552] Failing at address: 0x20b2e68
[compn7:03552] [ 0] /lib64/libpthread.so.0(+0xf3c0) [0x7f46e0fc33c0]
[compn7:03552] [ 1]
/home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0xd577)
[0x7f46dd093577]
[compn7:03552] [ 2]
/home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x5b4c)
[0x7f46dc5edb4c]
[compn7:03552] [ 3]
/home/gmaj/lib/openmpi/lib/libopen-pal.so.0(+0x1dbe8) [0x7f46e0679be8]
[compn7:03552] [ 4]
(home/gmaj/lib/openmpi/lib/libopen-pal.so.0(opal_progress+0xa1)
[0x7f46e066dbf1]
[compn7:03552] [ 5]
/home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5945)
[0x7f46dd08b945]
[compn7:03552] [ 6]
/home/gmaj/lib/openmpi/lib/libmpi.so.0(MPI_Send+0x6a) [0x7f46e0b4f10a]
[compn7:03552] [ 7] /home/gmaj/matrix/matrix(BI_Ssend+0x21) [0x49cc11]
[compn7:03552] [ 8] /home/gmaj/matrix/matrix(BI_IdringBR+0x79) [0x49c579]
[compn7:03552] [ 9] /home/gmaj/matrix/matrix(ilp64_Cdgebr2d+0x221) [0x495bb1]
[compn7:03552] [10] /home/gmaj/matrix/matrix(Cdgebr2d+0xd0) [0x47ffb0]
[compn7:03552] [11]
/home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so(PB_CInV2+0x1304)
[0x7f46e27f5124]
[compn7:03552] *** End of error message ***

This error appears during some scalapack computation. My processes do
some mpi communication before this error appears.

I found out, that by modifying btl_tcp_eager_limit and
btl_tcp_max_send_size parameters, I can run more processes - the
smaller those values are, the more processes I can run. Unfortunately,
by this method I've succeeded to run up to 30 processes, which is
still far to small.

Some clue may be what valgrind says:

==3894== Syscall param writev(vector[...]) points to uninitialised byte(s)
==3894==    at 0x82D009B: writev (in /lib64/libc-2.12.90.so)
==3894==    by 0xBA2136D: mca_btl_tcp_frag_send (in
/home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
==3894==    by 0xBA203D0: mca_btl_tcp_endpoint_send (in
/home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
==3894==    by 0xB003583: mca_pml_ob1_send_request_start_rdma (in
/home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==3894==    by 0xAFFA7C9: mca_pml_ob1_send (in
/home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==3894==    by 0x6D4B109: PMPI_Send (in /home/gmaj/lib/openmpi/lib/libmpi.so.0)
==3894==    by 0x49CC10: BI_Ssend (in /home/gmaj/matrix/matrix)
==3894==    by 0x49C578: BI_IdringBR (in /home/gmaj/matrix/matrix)
==3894==    by 0x495BB0: ilp64_Cdgebr2d (in /home/gmaj/matrix/matrix)
==3894==    by 0x47FFAF: Cdgebr2d (in /home/gmaj/matrix/matrix)
==3894==    by 0x51B38E0: PB_CInV2 (in
/home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so)
==3894==    by 0x51DB89B: PB_CpgemmAB (in
/home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so)
==3894==  Address 0xadecdce is 461,886 bytes inside a block of size
527,544 alloc'd
==3894==    at 0x4C2615D: malloc (vg_replace_malloc.c:195)
==3894==    by 0x6D0BBA3: ompi_free_list_grow (in
/home/gmaj/lib/openmpi/lib/libmpi.so.0)
==3894==    by 0xBA1E1A4: mca_btl_tcp_component_init (in
/home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
==3894==    by 0x6D5C909: mca_btl_base_select (in
/home/gmaj/lib/openmpi/lib/libmpi.so.0)
==3894==    by 0xB40E950: mca_bml_r2_component_init (in
/home/gmaj/lib/openmpi/lib/openmpi/mca_bml_r2.so)
==3894==    by 0x6D5C07E: mca_bml_base_init (in
/home/gmaj/lib/openmpi/lib/libmpi.so.0)
==3894==    by 0xAFF8A0E: mca_pml_ob1_component_init (in
/home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==3894==    by 0x6D663B2: mca_pml_base_select (in
/home/gmaj/lib/openmpi/lib/libmpi.so.0)
==3894==    by 0x6D25D20: ompi_mpi_init (in
/home/gmaj/lib/openmpi/lib/libmpi.so.0)
==3894==    by 0x6D45987: PMPI_Init_thread (in
/home/gmaj/lib/openmpi/lib/libmpi.so.0)
==3894==    by 0x42490A: MPI::Init_thread(int&, char**&, int)
(functions_inln.h:150)
==3894==    by 0x41F483: main (matrix.cpp:83)

I've tried to configure open-mpi with option --without-memory-manager,
but it didn't help.

I can successfully run exactly the same application on other machines,
having the number of nodes even over 800.

Does anyone have any idea how to further debug this issue? Any help
would be appreciated.

Thanks,
Grzegorz Maj

Reply via email to