Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110
Tony -- My apologies for taking so long to answer. :-( I was unfortunately unable to replicate your problem. I ran your source code across 32 machines connected by TCP with no problem: mpirun --hostfile ~/mpi/cdc -np 32 -mca btl tcp,self netbench 8 I tried this on two different clusters with the same results -- it didn't hang. :-( Can you try again with a recent nightly tarball, or the 1.1.1 beta tarball that has been posted? http://www.open-mpi.org/software/ompi/v1.1/ On 6/30/06 8:35 AM, "Tony Ladd"wrote: > Jeff > > Thanks for the reply; I realize you guys must be really busy with the recent > release of openmpi. I tried 1.1 and I don't get error messages any more. But > the code now hangs; no error or exit. So I am not sure if this is the same > issue or something else. I am enclosing my source code. I compiled with icc > and linked against an icc compiled version of openmpi-1.1. > > My program is a set of network benchmarks (a crude kind of netpipe) that > checks typical message passing patterns in my application codes. > Typical output is: > > 32 CPU's: sync call time = 1003.0time > rate (Mbytes/s) bandwidth (MBits/s) > loop buffers size XC XE GS MS XC > XE GS MS XC XE GS MS >1 6416384 2.48e-02 1.99e-02 1.21e+00 3.88e-02 4.23e+01 > 5.28e+01 8.65e-01 2.70e+01 1.08e+04 1.35e+04 4.43e+02 1.38e+04 >2 6416384 2.17e-02 2.09e-02 1.21e+00 4.10e-02 4.82e+01 > 5.02e+01 8.65e-01 2.56e+01 1.23e+04 1.29e+04 4.43e+02 1.31e+04 >3 6416384 2.20e-02 1.99e-02 1.01e+00 3.95e-02 4.77e+01 > 5.27e+01 1.04e+00 2.65e+01 1.22e+04 1.35e+04 5.33e+02 1.36e+04 >4 6416384 2.16e-02 1.96e-02 1.25e+00 4.00e-02 4.85e+01 > 5.36e+01 8.37e-01 2.62e+01 1.24e+04 1.37e+04 4.28e+02 1.34e+04 >5 6416384 2.25e-02 2.00e-02 1.25e+00 4.07e-02 4.66e+01 > 5.24e+01 8.39e-01 2.57e+01 1.19e+04 1.34e+04 4.30e+02 1.32e+04 >6 6416384 2.19e-02 1.99e-02 1.29e+00 4.05e-02 4.79e+01 > 5.28e+01 8.14e-01 2.59e+01 1.23e+04 1.35e+04 4.17e+02 1.33e+04 >7 6416384 2.19e-02 2.06e-02 1.25e+00 4.03e-02 4.79e+01 > 5.09e+01 8.38e-01 2.60e+01 1.23e+04 1.30e+04 4.29e+02 1.33e+04 >8 6416384 2.24e-02 2.06e-02 1.25e+00 4.01e-02 4.69e+01 > 5.09e+01 8.39e-01 2.62e+01 1.20e+04 1.30e+04 4.30e+02 1.34e+04 >9 6416384 4.29e-01 2.01e-02 6.35e-01 3.98e-02 2.45e+00 > 5.22e+01 1.65e+00 2.64e+01 6.26e+02 1.34e+04 8.46e+02 1.35e+04 > 10 6416384 2.16e-02 2.06e-02 8.87e-01 4.00e-02 4.85e+01 > 5.09e+01 1.18e+00 2.62e+01 1.24e+04 1.30e+04 6.05e+02 1.34e+04 > > Time is total for all 64 buffers. Rate is one way across one link (# of > bytes/time). > 1) XC is a bidirectional ring exchange. Each processor sends to the right > and receives from the left > 2) XE is an edge exchange. Pairs of nodes exchange data, with each one > sending and receiving > 3) GS is the MPI_AllReduce > 4) MS is my version of MPI_AllReduce. It splits the vector into Np blocks > (Np is # of processors); each processor then acts as a head node for one > block. This uses the full bandwidth all the time, unlike AllReduce which > thins out as it gets to the top of the binary tree. On a 64 node Infiniband > system MS is about 5X faster than GS-in theory it would be 6X; ie log_2(64). > Here it is 25X-not sure why so much. But MS seems to be the cause of the > hangups with messages > 64K. I can run the other benchmarks OK,but this one > seems to hang for large messages. I think the problem is at least partly due > to the switch. All MS is doing is point to point communications, but > unfortunately it sometimes requires a high bandwidth between ASIC's. It > first it exchanges data between near neighbors in MPI_COMM_WORLD, but it > must progressively span wider gaps between nodes as it goes up the various > binary trees. After a while this requires extensive traffic between ASICS. > This seems to be a problem on both my HP 2724 and the Extreme Networks > Summit400t-48. I am currently working with Extreme to try to resolve the > switch issue. As I say; the code ran great on Infiniband, but I think those > switches have hardware flow control. Finally I checked the code again under > LAM and it ran OK. Slow, but no hangs. > > To run the code compile and type: > mpirun -np 32 -machinefile hosts src/netbench 8 > The 8 means 2^8 bytes (ie 256K). This was enough to hang every time on my > boxes. > > You can also edit the header file (header.h). MAX_LOOPS is how many times it > runs each test (currently 10); NUM_BUF is the number of buffers in each test > (must be more than number of processors), SYNC defines the global sync > frequency-every SYNC buffers. NUM_SYNC is the number of sequential barrier > calls it uses to determine the mean barrier call time. You can also
Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110
Sorry for the delay in replying -- sometimes we just get overwhelmed with all the incoming mail. :-( > -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Tony Ladd > Sent: Saturday, June 17, 2006 9:47 AM > To: us...@open-mpi.org > Subject: [OMPI users] mca_btl_tcp_frag_send: writev failed > with errno=110 > > I am getting the following error with openmpi-1.1b1 > > mca_btl_tcp_frag_send: writev failed with errno=110 Can you try this with the final released version of 1.1, just to see if the problem still exists? 110 = ETIMEDOUT, which seems like a strange error to get here, because the TCP connection should have already been made. > 1) This does not ever happen with other MPI's I have tried > like MPICH and > LAM > 2) It only seems to happen with large numbers of cpus, 32 and > occasionally > 16, and with larger messages sizes. In this case it ws 128K. > 3) It only seems to happen with dual cpus on each node. > 4) My configuration is default with (in openmpi-mca-params.conf): > pls_rsh_agent = rsh > btl = tcp,self > btl_tcp_if_include = eth1 > I also set --mca btl_tcp_eager_limit 131072 when running the > program, though > leaving this out does not eliminate the problem. > > My program is a communication test; it sends bidirectional > point to point > messages among N cpus. In one test it exchanges messages > between pairs of > cpus, in another it reads from the node on its left and sends > to the node on > its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE. Can you share your code and give a recipe for replicating the problem? > Finally: the tcp driver in openmpi seems not nearly as good > as the one in > LAM. I got higher throughput with far fewer dropouts with LAM. This is unfortunately a known issue. The reason for it is that all the current Open MPI members concentrate mainly on high-speed networks such as InfiniBand, shared memory, and Myrinet. TCP *works*, and so far that has been "good enough," but we're all aware that it still needs to be optimized. The issue is actually not the protocols that we're using over TCP. We're pretty sure that it has to do with how Open MPI's file descriptor progression engine works (disclaimer: we haven't spent a lot of time trying to categorize this since we've been focusing on the high speed networks, but we're pretty sure that this is the Big issue). Internally, we use the software package "libevent" as an engine for fd and signal progress, but there are some cases that seem to be somewhat inefficient. We use this progression engine (as opposed to, say, a dedicated socket state machine in the TCP BTL itself) because we need to make progress on both the MPI TCP communications and the underlying run-time environment (ORTE) TCP communications. Hence, we needed a central "engine" that can handle both. This is an area that we would love to get some outside help -- it's not so much a network issues, but more likely a systems issue. None of us currently have engineering resources to spend time on this; is there anyone out there in the open source community that could help? If so, we can provide more details on where we think the bottlenecks are, etc. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
[OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110
I am getting the following error with openmpi-1.1b1 mca_btl_tcp_frag_send: writev failed with errno=110 1) This does not ever happen with other MPI's I have tried like MPICH and LAM 2) It only seems to happen with large numbers of cpus, 32 and occasionally 16, and with larger messages sizes. In this case it ws 128K. 3) It only seems to happen with dual cpus on each node. 4) My configuration is default with (in openmpi-mca-params.conf): pls_rsh_agent = rsh btl = tcp,self btl_tcp_if_include = eth1 I also set --mca btl_tcp_eager_limit 131072 when running the program, though leaving this out does not eliminate the problem. My program is a communication test; it sends bidirectional point to point messages among N cpus. In one test it exchanges messages between pairs of cpus, in another it reads from the node on its left and sends to the node on its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE. Finally: the tcp driver in openmpi seems not nearly as good as the one in LAM. I got higher throughput with far fewer dropouts with LAM. Tony --- Tony Ladd Professor, Chemical Engineering University of Florida PO Box 116005 Gainesville, FL 32611-6005 Tel: 352-392-6509 FAX: 352-392-9513 Email: tl...@che.ufl.edu Web: http://ladd.che.ufl.edu