On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:

> Hi everyone,
> 
> We have problem with our Cassandra cluster, and that is that sometimes it 
> takes several seconds to open a new Thrift connection to the server. We've 
> had this issue when we ran on windows, and we have this issue now that we run 
> on Ubuntu. We've had it with our old networking setup, and we have it with 
> our new networking setup where we're running it over a dedicated gigabit 
> network. Normally estabishing a new connection is instant, but once in a 
> while it seems like it's not accepting any new connections until three 
> seconds have passed.
> 
> We're of course running a connection-pooling client which mitigates this, 
> since once a connection is established, it's rock solid.
> 
> We tried switching the rpc_server_type to hsha, but that seems to have made 
> the problem worse, we're seeing more connection timeouts because of this.
> 
> For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, and our 
> connection pool is configured to abort a connection attempt after two 
> seconds, and each connection lives for six hours and then it's recycled. 
> Under current load we do about 500 writes/s and 100 reads/s, we have 20 
> clients, but each has a very small connection pool of maybe up to 5 
> simultaneous connections against each Cassandra server. We see these 
> connection issues maybe once a day, but always at random intervals.
> 
> We've tried to get more information through Datastax Opscenter, the JMX 
> console, and our own application monitoring and logging, but we can't see 
> anything out of the ordinary. Sometimes, seemingly by random, it's just 
> really slow to connect. We're all out of ideas. Does anyone here have 
> suggestions on where to look and what to do next?

Have you ironed out non-cassandra potential causes ?

3 seconds constantly sounds it could be a timeout/retry somewhere.  Do you 
contact cassandra via a hostname or IP address ?  If via hostname, iron out DNS.

Either way, I'd fire up tcpdump, both on both the client and the server, and 
observe the TCP handshake.  Specifically see if the SYN packet is sent and 
received, whether the SYN-ACK is sent back right away and received, and final 
ACK.

If that looks good, then TCP-wise you're in good shape and the problem is in a 
higher layer (thrift).  If not, see where the delay/drop/retry happens.  If 
it's in the first packet, it may be a networking/routing issue.  If in the 
second, it may me capacity at the server (investigate with lsof/netstat/JMX), 
etc..


Reply via email to