Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-07-28 Thread Jeff Squyres
Tony --

My apologies for taking so long to answer.  :-(

I was unfortunately unable to replicate your problem.  I ran your source
code across 32 machines connected by TCP with no problem:

  mpirun --hostfile ~/mpi/cdc -np 32 -mca btl tcp,self netbench 8

I tried this on two different clusters with the same results -- it didn't
hang.  :-(

Can you try again with a recent nightly tarball, or the 1.1.1 beta tarball
that has been posted?

  http://www.open-mpi.org/software/ompi/v1.1/


On 6/30/06 8:35 AM, "Tony Ladd"  wrote:

> Jeff
> 
> Thanks for the reply; I realize you guys must be really busy with the recent
> release of openmpi. I tried 1.1 and I don't get error messages any more. But
> the code now hangs; no error or exit. So I am not sure if this is the same
> issue or something else. I am enclosing my source code. I compiled with icc
> and linked against an icc compiled version of openmpi-1.1.
> 
> My program is a set of network benchmarks (a crude kind of netpipe) that
> checks typical message passing patterns in my application codes.
> Typical output is:
> 
>  32 CPU's: sync call time = 1003.0time
> rate (Mbytes/s) bandwidth (MBits/s)
>  loop   buffers  size XC   XE   GS   MS XC
> XE   GS   MS XC   XE   GS   MS
>1   6416384  2.48e-02 1.99e-02 1.21e+00 3.88e-02   4.23e+01
> 5.28e+01 8.65e-01 2.70e+01   1.08e+04 1.35e+04 4.43e+02 1.38e+04
>2   6416384  2.17e-02 2.09e-02 1.21e+00 4.10e-02   4.82e+01
> 5.02e+01 8.65e-01 2.56e+01   1.23e+04 1.29e+04 4.43e+02 1.31e+04
>3   6416384  2.20e-02 1.99e-02 1.01e+00 3.95e-02   4.77e+01
> 5.27e+01 1.04e+00 2.65e+01   1.22e+04 1.35e+04 5.33e+02 1.36e+04
>4   6416384  2.16e-02 1.96e-02 1.25e+00 4.00e-02   4.85e+01
> 5.36e+01 8.37e-01 2.62e+01   1.24e+04 1.37e+04 4.28e+02 1.34e+04
>5   6416384  2.25e-02 2.00e-02 1.25e+00 4.07e-02   4.66e+01
> 5.24e+01 8.39e-01 2.57e+01   1.19e+04 1.34e+04 4.30e+02 1.32e+04
>6   6416384  2.19e-02 1.99e-02 1.29e+00 4.05e-02   4.79e+01
> 5.28e+01 8.14e-01 2.59e+01   1.23e+04 1.35e+04 4.17e+02 1.33e+04
>7   6416384  2.19e-02 2.06e-02 1.25e+00 4.03e-02   4.79e+01
> 5.09e+01 8.38e-01 2.60e+01   1.23e+04 1.30e+04 4.29e+02 1.33e+04
>8   6416384  2.24e-02 2.06e-02 1.25e+00 4.01e-02   4.69e+01
> 5.09e+01 8.39e-01 2.62e+01   1.20e+04 1.30e+04 4.30e+02 1.34e+04
>9   6416384  4.29e-01 2.01e-02 6.35e-01 3.98e-02   2.45e+00
> 5.22e+01 1.65e+00 2.64e+01   6.26e+02 1.34e+04 8.46e+02 1.35e+04
>   10   6416384  2.16e-02 2.06e-02 8.87e-01 4.00e-02   4.85e+01
> 5.09e+01 1.18e+00 2.62e+01   1.24e+04 1.30e+04 6.05e+02 1.34e+04
> 
> Time is total for all 64 buffers. Rate is one way across one link (# of
> bytes/time).
> 1) XC is a bidirectional ring exchange. Each processor sends to the right
> and receives from the left
> 2) XE is an edge exchange. Pairs of nodes exchange data, with each one
> sending and receiving
> 3) GS is the MPI_AllReduce
> 4) MS is my version of MPI_AllReduce. It splits the vector into Np blocks
> (Np is # of processors); each processor then acts as a head node for one
> block. This uses the full bandwidth all the time, unlike AllReduce which
> thins out as it gets to the top of the binary tree. On a 64 node Infiniband
> system MS is about 5X faster than GS-in theory it would be 6X; ie log_2(64).
> Here it is 25X-not sure why so much. But MS seems to be the cause of the
> hangups with messages > 64K. I can run the other benchmarks OK,but this one
> seems to hang for large messages. I think the problem is at least partly due
> to the switch. All MS is doing is point to point communications, but
> unfortunately it sometimes requires a high bandwidth between ASIC's. It
> first it exchanges data between near neighbors in MPI_COMM_WORLD, but it
> must progressively span wider gaps between nodes as it goes up the various
> binary trees. After a while this requires extensive traffic between ASICS.
> This seems to be a problem on both my HP 2724 and the Extreme Networks
> Summit400t-48. I am currently working with Extreme to try to resolve the
> switch issue. As I say; the code ran great on Infiniband, but I think those
> switches have hardware flow control. Finally I checked the code again under
> LAM and it ran OK. Slow, but no hangs.
> 
> To run the code compile and type:
> mpirun -np 32 -machinefile hosts src/netbench 8
> The 8 means 2^8 bytes (ie 256K). This was enough to hang every time on my
> boxes.
> 
> You can also edit the header file (header.h). MAX_LOOPS is how many times it
> runs each test (currently 10); NUM_BUF is the number of buffers in each test
> (must be more than number of processors), SYNC defines the global sync
> frequency-every SYNC buffers. NUM_SYNC is the number of sequential barrier
> calls it uses to determine the mean barrier call time. You can also 

Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-06-29 Thread Jeff Squyres (jsquyres)
Sorry for the delay in replying -- sometimes we just get overwhelmed
with all the incoming mail.  :-(

> -Original Message-
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Tony Ladd
> Sent: Saturday, June 17, 2006 9:47 AM
> To: us...@open-mpi.org
> Subject: [OMPI users] mca_btl_tcp_frag_send: writev failed 
> with errno=110
> 
> I am getting the following error with openmpi-1.1b1
> 
> mca_btl_tcp_frag_send: writev failed with errno=110

Can you try this with the final released version of 1.1, just to see if
the problem still exists?

110 = ETIMEDOUT, which seems like a strange error to get here, because
the TCP connection should have already been made.

> 1) This does not ever happen with other MPI's I have tried 
> like MPICH and
> LAM
> 2) It only seems to happen with large numbers of cpus, 32 and 
> occasionally
> 16, and with larger messages sizes. In this case it ws 128K.
> 3) It only seems to happen with dual cpus on each node.
> 4) My configuration is default with (in openmpi-mca-params.conf): 
> pls_rsh_agent = rsh 
> btl = tcp,self 
> btl_tcp_if_include = eth1 
> I also set --mca btl_tcp_eager_limit 131072 when running the 
> program, though
> leaving this out does not eliminate the problem.
> 
> My program is a communication test; it sends bidirectional 
> point to point
> messages among N cpus. In one test it exchanges messages 
> between pairs of
> cpus, in another it reads from the node on its left and sends 
> to the node on
> its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE.

Can you share your code and give a recipe for replicating the problem?

> Finally: the tcp driver in openmpi seems not nearly as good 
> as the one in
> LAM. I got higher throughput with far fewer dropouts with LAM.

This is unfortunately a known issue.  The reason for it is that all the
current Open MPI members concentrate mainly on high-speed networks such
as InfiniBand, shared memory, and Myrinet.  TCP *works*, and so far that
has been "good enough," but we're all aware that it still needs to be
optimized.

The issue is actually not the protocols that we're using over TCP.
We're pretty sure that it has to do with how Open MPI's file descriptor
progression engine works (disclaimer: we haven't spent a lot of time
trying to categorize this since we've been focusing on the high speed
networks, but we're pretty sure that this is the Big issue).  

Internally, we use the software package "libevent" as an engine for fd
and signal progress, but there are some cases that seem to be somewhat
inefficient.  We use this progression engine (as opposed to, say, a
dedicated socket state machine in the TCP BTL itself) because we need to
make progress on both the MPI TCP communications and the underlying
run-time environment (ORTE) TCP communications.  Hence, we needed a
central "engine" that can handle both.

This is an area that we would love to get some outside help -- it's not
so much a network issues, but more likely a systems issue.  None of us
currently have engineering resources to spend time on this; is there
anyone out there in the open source community that could help?  If so,
we can provide more details on where we think the bottlenecks are, etc.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



[OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-06-17 Thread Tony Ladd
I am getting the following error with openmpi-1.1b1

mca_btl_tcp_frag_send: writev failed with errno=110

1) This does not ever happen with other MPI's I have tried like MPICH and
LAM
2) It only seems to happen with large numbers of cpus, 32 and occasionally
16, and with larger messages sizes. In this case it ws 128K.
3) It only seems to happen with dual cpus on each node.
4) My configuration is default with (in openmpi-mca-params.conf): 
pls_rsh_agent = rsh 
btl = tcp,self 
btl_tcp_if_include = eth1 
I also set --mca btl_tcp_eager_limit 131072 when running the program, though
leaving this out does not eliminate the problem.

My program is a communication test; it sends bidirectional point to point
messages among N cpus. In one test it exchanges messages between pairs of
cpus, in another it reads from the node on its left and sends to the node on
its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE.

Finally: the tcp driver in openmpi seems not nearly as good as the one in
LAM. I got higher throughput with far fewer dropouts with LAM.

Tony


---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu