Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

Steven A Robenalt Mon, 18 Nov 2013 17:50:13 -0800

Hi all,

I am attempting to bring up our new app on a 3-node cluster and am having
problems with frequent read timeouts and slow inter-node replication.
Initially, these errors were mostly occurring in our app server, affecting
0.02%-1.0% of our queries in an otherwise unloaded cluster. No exceptions
were logged on the servers in this case, and reads in a single node
environment with the same code and client driver virtually never see
exceptions like this, so I suspect problems with the inter-cluster
communication between nodes.


The 3 nodes are deployed in a single AWS VPC, and are all in a common
subnet. The Cassandra version is 2.0.2 following an upgrade this past
weekend due to NPEs in a secondary index that were affecting certain
queries under 2.0.1. The servers are m1.large instances running AWS Linux
and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
All database contents are CQL tables with replication factor of 3, and the
application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.

In testing with the application, I noticed this afternoon that the contents
of the 3 nodes differed in their respective copies of the same table for
newly written data, for time periods exceeding several minutes, as reported
by cqlsh on each node. Specifying different hosts from the same server
using cqlsh also exhibited timeouts on multiple attempts to connect, and on
executing some queries, though they eventually succeeded in all cases, and
eventually the data in all nodes was fully replicated.

The AWS servers have a security group with only ports 22, 7000, 9042, and
9160 open.

At this time, it seems that either I am still missing something in my
cluster configuration, or maybe there are other ports that are needed for
inter-node communication.

Any advice/suggestions would be appreciated.



-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobe...@stanford.edu
http://highwire.stanford.edu

Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

Reply via email to