Re: [OMPI users] collective communications broken on more than 4 cores
This also appears to fix a bug I had reported that did not involve collective calls. The code is appended. When run on 64 bit architecture with iter.cary$ gcc --version gcc (GCC) 4.4.0 20090506 (Red Hat 4.4.0-4) Copyright (C) 2009 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. iter.cary$ uname -a Linux iter.txcorp.com 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux iter.cary$ mpicc -show gcc -I/usr/local/openmpi-1.3.2-nodlopen/include -pthread -L/usr/local/torque-2.4.0b1/lib -Wl,--rpath -Wl,/usr/local/torque-2.4.0b1/lib -Wl,-rpath,/usr/local/openmpi-1.3.2-nodlopen/lib -L/usr/local/openmpi-1.3.2-nodlopen/lib -lmpi -lopen-rte -lopen-pal -ltorque -ldl -lnsl -lutil -lm as mpirun -n 3 ompi1.3.3-bug it hangs after some 100-500 iterations. When run mpirun -n 3 -mca btl ^sm ./ompi1.3.3-bug or mpirun -n 3 -mca btl_sm_num_fifos 5 ./ompi1.3.3-bug it seems to work fine. Valgrind points to some issues: ==29641== Syscall param sched_setaffinity(mask) points to unaddressable byte(s) ==29641==at 0x30B5EDAA79: syscall (in /lib64/libc-2.10.1.so) ==29641==by 0x54B5098: opal_paffinity_linux_plpa_api_probe_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x54B7394: opal_paffinity_linux_plpa_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x54B5D39: opal_paffinity_linux_plpa_have_topology_information (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x54B4F3F: linux_module_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x54B2D03: opal_paffinity_base_select (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x548C3D3: opal_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x520F09C: orte_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x4E67D26: ompi_mpi_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29641==by 0x4E87195: PMPI_Init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29641==by 0x408011: main (in /home/research/cary/ompi1.3.3-bug) ==29641== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==29641== Warning: client syscall munmap tried to modify addresses 0x-0xffe ==29640== Warning: client syscall munmap tried to modify addresses 0x-0xffe ==29639== Warning: client syscall munmap tried to modify addresses 0x-0xffe ==29641== ==29641== Syscall param writev(vector[...]) points to uninitialised byte(s) ==29641==at 0x30B5ED67AB: writev (in /lib64/libc-2.10.1.so) ==29641==by 0x5241686: mca_oob_tcp_msg_send_handler (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x52426BC: mca_oob_tcp_peer_send (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x52450EC: mca_oob_tcp_send_nb (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x5255B33: orte_rml_oob_send_buffer (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x5230682: allgather (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x5230179: modex (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x4E68199: ompi_mpi_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29641==by 0x4E87195: PMPI_Init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29641==by 0x408011: main (in /home/research/cary/ompi1.3.3-bug) ==29641== Address 0x5c89aef is 87 bytes inside a block of size 128 alloc'd ==29641==at 0x4A0763E: malloc (vg_replace_malloc.c:207) ==29641==by 0x548D76A: opal_dss_buffer_extend (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x548E780: opal_dss_pack (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-pal.so.0.0.0) ==29641==by 0x5230620: allgather (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x5230179: modex (in /usr/local/openmpi-1.3.2-nodlopen/lib/libopen-rte.so.0.0.0) ==29641==by 0x4E68199: ompi_mpi_init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29641==by 0x4E87195: PMPI_Init (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29641==by 0x408011: main (in /home/research/cary/ompi1.3.3-bug) ==29640== Conditional jump or move depends on uninitialised value(s) ==29640==at 0x4EF26A4: mca_mpool_sm_alloc (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29640==by 0x4E4BEEF: ompi_free_list_grow (in /usr/local/openmpi-1.3.2-nodlopen/lib/libmpi.so.0.0.0) ==29640==by 0x4EA8793: mca_btl_sm_add_procs (in
Re: [OMPI users] collective communications broken on more than 4 cores
> >>> It seems that the calls to collective communication are not > >>> returning for some MPI processes, when the number of processes is > >>> greater or equal to 5. It's reproduceable, on two different > >>> architectures, with two different versions of OpenMPI (1.3.2 and > >>> 1.3.3). It was working correctly with OpenMPI version 1.2.7. > >> > >> Does it work if you turn off the shared memory transport layer; > >> that is, > >> > >> mpirun -n 6 -mca btl ^sm ./testmpi > > > > Yes it does, on both my configurations (AMD and Intel processor). > > So it seems that the shared memory synchronization process is > > broken. > > Presumably that is this bug: > https://svn.open-mpi.org/trac/ompi/ticket/2043 Yes it is. > I also found by trial and error that increasing the number of fifos, eg > -mca btl_sm_num_fifos 5 > on a 6-processor job, apparently worked around the problem. > But yes, something seems broken in OpenMP shared memory transport with > gcc 4.4.x. Yes, same for me: -mca btl_sm_num_fifos 5 worked. Thanks for your answer Jonathan. If I may help the developpers in any way to track this bug get into contact with me. --Vincent
Re: [OMPI users] collective communications broken on more than 4 cores
On 2009-10-29, at 10:21AM, Vincent Loechner wrote: It seems that the calls to collective communication are not returning for some MPI processes, when the number of processes is greater or equal to 5. It's reproduceable, on two different architectures, with two different versions of OpenMPI (1.3.2 and 1.3.3). It was working correctly with OpenMPI version 1.2.7. Does it work if you turn off the shared memory transport layer; that is, mpirun -n 6 -mca btl ^sm ./testmpi Yes it does, on both my configurations (AMD and Intel processor). So it seems that the shared memory synchronization process is broken. Presumably that is this bug: https://svn.open-mpi.org/trac/ompi/ticket/2043 I also found by trial and error that increasing the number of fifos, eg -mca btl_sm_num_fifos 5 on a 6-processor job, apparently worked around the problem. But yes, something seems broken in OpenMP shared memory transport with gcc 4.4.x. Jonathan -- Jonathan Dursi
Re: [OMPI users] collective communications broken on more than 4 cores
> > It seems that the calls to collective communication are not > > returning for some MPI processes, when the number of processes is > > greater or equal to 5. It's reproduceable, on two different > > architectures, with two different versions of OpenMPI (1.3.2 and > > 1.3.3). It was working correctly with OpenMPI version 1.2.7. > > Does it work if you turn off the shared memory transport layer; that is, > > mpirun -n 6 -mca btl ^sm ./testmpi Yes it does, on both my configurations (AMD and Intel processor). So it seems that the shared memory synchronization process is broken. Could be a system bug, I don't know what library OpenMPI uses (is it IPC ?). Both my systems are Linux 2.6.31, the AMD is Ubuntu, and the Intel is an ARCH-linux. --Vincent
Re: [OMPI users] collective communications broken on more than 4 cores
On 2009-10-29, at 9:57AM, Vincent Loechner wrote: [...] It seems that the calls to collective communication are not returning for some MPI processes, when the number of processes is greater or equal to 5. It's reproduceable, on two different architectures, with two different versions of OpenMPI (1.3.2 and 1.3.3). It was working correctly with OpenMPI version 1.2.7. [...] GCC version : $ mpicc --version gcc (Ubuntu 4.4.1-4ubuntu7) 4.4.1 Does it work if you turn off the shared memory transport layer; that is, mpirun -n 6 -mca btl ^sm ./testmpi ? - Jonathan -- Jonathan Dursi
[OMPI users] collective communications broken on more than 4 cores
Hello to the list, I came to a problem running a simple program with collective communications, on a 6-core processors (6 local MPI processes). It seems that the calls to collective communication are not returning for some MPI processes, when the number of processes is greater or equal to 5. It's reproduceable, on two different architectures, with two different versions of OpenMPI (1.3.2 and 1.3.3). It was working correctly with OpenMPI version 1.2.7. I just wrote a very simple test, making 1000 calls to MPI_Barrier(). Running on an istanbul processor (6-core AMD Opteron) : $ uname -a Linux istanbool 2.6.31-14-generic #46-Ubuntu SMP Tue Oct 13 16:47:28 UTC 2009 x86_64 GNU/Linux with a OpenMPI ubuntu package, version 1.3.2. Running with 5 or 6 MPI processes, it just hangs after a random number of iterations, ranging from 3 to 600, and sometimes it finishes correctly (about 1 time out of 8). Just ran : 'mpirun -n 6 ./testmpi' Same behavior with more MPI processes. I tried the '--mca coll_basic_priority 50' option, the program has more chance to finish -about one time out of 2, but also deadlocks the other time after a random number of iterations. Without setting the coll_basic_priority option, I ran a debugger, and found out that some processes are blocked in: #0 0x7f858f272f7a in opal_progress () from /usr/lib/libopen-pal.so.0 #1 0x7f858f7524f5 in ?? () from /usr/lib/libmpi.so.0 #2 0x7f8589e74c5a in ?? () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so #3 0x7f8589e7cefa in ?? () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so #4 0x7f858f767b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0 #5 0x00400c10 in main (argc=1, argv=0x7fff9d59acf8) at testmpi.c:24 and the others in: #0 0x7f05799e933a in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so #1 0x7f057dd22fba in opal_progress () from /usr/lib/libopen-pal.so.0 #2 0x7f057e2024f5 in ?? () from /usr/lib/libmpi.so.0 #3 0x7f0578924c5a in ?? () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so #4 0x7f057892cefa in ?? () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so #5 0x7f057e217b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0 #6 0x00400c10 in main (argc=1, argv=0x7fff1b55b4a8) at testmpi.c:24 Seems that other collective communications are broken, my original program was blocked after a call to MPI_Allreduce. I also made tests on a 4-core Intel core i7, openMPI version 1.3.3, with exatly the same problem: calls to collective communication not returning for some MPI processes when the number of processes is greater or equal to 5. Below, some technical details on my configuration, input file, example outputs. The output of ompi_info --all is attached to this mail. Best regards, -- Vincent LOECHNER |0---0 | ICPS, LSIIT (UMR 7005), PhD | /| /| | Equipe INRIA CAMUS, Phone: +33 (0)368 85 45 37 | 0---0 | | Université de Strasbourg Fax : +33 (0)368 85 45 47 | | 0-|-0 | Pôle API, Bd. Sébastien Brant | |/ |/ | F-67412 ILLKIRCH Cedex loech...@unistra.fr | 0---0| http://icps.u-strasbg.fr -- Input program: // testmpi.c --- #include #include #define MCW MPI_COMM_WORLD int main( int argc, char **argv ) { int n, r; /* number of processes, process rank */ int i; MPI_Init( , ); MPI_Comm_size( MCW, ); MPI_Comm_rank( MCW, ); for( i=0 ; i<1000 ; i++ ) { printf( "(%d) %d\n", r, i ); fflush(stdout); MPI_Barrier( MCW ); } MPI_Finalize(); return( 0 ); } // testmpi.c --- Compilation line: $ mpicc -O2 -Wall -g testmpi.c -o testmpi GCC version : $ mpicc --version gcc (Ubuntu 4.4.1-4ubuntu7) 4.4.1 OpenMPI version : 1.3.2 $ ompi_info -v ompi full Package: Open MPI buildd@crested Distribution Open MPI: 1.3.2 Open MPI SVN revision: r21054 Open MPI release date: Apr 21, 2009 Open RTE: 1.3.2 Open RTE SVN revision: r21054 Open RTE release date: Apr 21, 2009 OPAL: 1.3.2 OPAL SVN revision: r21054 OPAL release date: Apr 21, 2009 Ident string: 1.3.2 --- example run (I hit ^C after a while) $ mpirun -n 6 ./testmpi (0) 0 (0) 1 (0) 2 (0) 3 (1) 0 (1) 1 (1) 2 (2) 0 (2) 1 (2) 2 (2) 3 (3) 0 (3) 1 (3) 2 (4) 0 (4) 1 (4) 2 (4) 3 (5) 0 (5) 1 (5) 2 (5) 3 ^Cmpirun: killing job... -- mpirun noticed that process rank 0 with PID 10466 on node istanbool exited on signal 0 (Unknown signal 0).