Hi Mina, The delay is not constant, in the absolute majority of cases, connecting is almost instant, but occasionally, connecting to a server takes a few seconds.
We can't even reproduce it reliably, we can see in our server logs that sometimes, maybe a few times a day, maybe once every few days, a cassandra server will be slow in accepting connections, and after a little while everything will be ok again. It's not a network saturation error, it's not a CPU saturation error. Not even GC pauses. Has anyone else noticed something similar? Or is this simply a result of us running a tight connection pool which recycles connections every few hours and only waits a few seconds for a connection before timing out? /Henrik On Thu, Jun 14, 2012 at 4:54 PM, Mina Naguib <[email protected]>wrote: > > On 2012-06-14, at 10:38 AM, Henrik Schröder wrote: > > > Hi everyone, > > > > We have problem with our Cassandra cluster, and that is that sometimes > it takes several seconds to open a new Thrift connection to the server. > We've had this issue when we ran on windows, and we have this issue now > that we run on Ubuntu. We've had it with our old networking setup, and we > have it with our new networking setup where we're running it over a > dedicated gigabit network. Normally estabishing a new connection is > instant, but once in a while it seems like it's not accepting any new > connections until three seconds have passed. > > > > We're of course running a connection-pooling client which mitigates > this, since once a connection is established, it's rock solid. > > > > We tried switching the rpc_server_type to hsha, but that seems to have > made the problem worse, we're seeing more connection timeouts because of > this. > > > > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, > and our connection pool is configured to abort a connection attempt after > two seconds, and each connection lives for six hours and then it's > recycled. Under current load we do about 500 writes/s and 100 reads/s, we > have 20 clients, but each has a very small connection pool of maybe up to 5 > simultaneous connections against each Cassandra server. We see these > connection issues maybe once a day, but always at random intervals. > > > > We've tried to get more information through Datastax Opscenter, the JMX > console, and our own application monitoring and logging, but we can't see > anything out of the ordinary. Sometimes, seemingly by random, it's just > really slow to connect. We're all out of ideas. Does anyone here have > suggestions on where to look and what to do next? > > Have you ironed out non-cassandra potential causes ? > > 3 seconds constantly sounds it could be a timeout/retry somewhere. Do you > contact cassandra via a hostname or IP address ? If via hostname, iron out > DNS. > > Either way, I'd fire up tcpdump, both on both the client and the server, > and observe the TCP handshake. Specifically see if the SYN packet is sent > and received, whether the SYN-ACK is sent back right away and received, and > final ACK. > > If that looks good, then TCP-wise you're in good shape and the problem is > in a higher layer (thrift). If not, see where the delay/drop/retry > happens. If it's in the first packet, it may be a networking/routing > issue. If in the second, it may me capacity at the server (investigate > with lsof/netstat/JMX), etc.. > > >
