Hi all, I am attempting to bring up our new app on a 3-node cluster and am having problems with frequent read timeouts and slow inter-node replication. Initially, these errors were mostly occurring in our app server, affecting 0.02%-1.0% of our queries in an otherwise unloaded cluster. No exceptions were logged on the servers in this case, and reads in a single node environment with the same code and client driver virtually never see exceptions like this, so I suspect problems with the inter-cluster communication between nodes.
The 3 nodes are deployed in a single AWS VPC, and are all in a common subnet. The Cassandra version is 2.0.2 following an upgrade this past weekend due to NPEs in a secondary index that were affecting certain queries under 2.0.1. The servers are m1.large instances running AWS Linux and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes. All database contents are CQL tables with replication factor of 3, and the application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver. In testing with the application, I noticed this afternoon that the contents of the 3 nodes differed in their respective copies of the same table for newly written data, for time periods exceeding several minutes, as reported by cqlsh on each node. Specifying different hosts from the same server using cqlsh also exhibited timeouts on multiple attempts to connect, and on executing some queries, though they eventually succeeded in all cases, and eventually the data in all nodes was fully replicated. The AWS servers have a security group with only ports 22, 7000, 9042, and 9160 open. At this time, it seems that either I am still missing something in my cluster configuration, or maybe there are other ports that are needed for inter-node communication. Any advice/suggestions would be appreciated. -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu