I should clarify that we are running Cassandra 1.1.12. Dave
On Fri, Oct 4, 2013 at 2:08 PM, Dave Cowen <[email protected]> wrote: > We're testing expanding a 4-node cluster into an 8-node cluster, and we > keep running into issues with the repair process near the end. > > We're bringing up nodes 1-by-1 into the cluster, retokening nodes for an > 8-node configuration, running nodetool cleanup on the nodes after each > retokening, and then increasing the replication factor to 5. This all works > without issue, and the cluster appears to be healthy in that 8-node > configuration with a replication factor of 5. > > However, when we then run nodetool repair on the nodes, it will at some > point stall, even when being run on one of the new nodes. > > It doesn't appear to stall while it's performing a compaction or > transferring CF data. We've monitored compactionstats and netstats closely, > and things always stall when a repair command is started, ie: > > [2013-10-02 23:19:39,254] Starting repair command #9, repairing 5 ranges > for keyspace ourkeyspace > > The last message from AntiEntropyService is usually something to the > effect of: > > <190>Oct 3 00:01:02 myhost.com 1970947950 [AntiEntropySessions:24] INFO > org.apache.cassandra.service.AntiEntropyService - [repair > #9b17d310-2bbd-11e3-0000-e06ec6c436ff] session completed successfully > > ... and then things don't start for the next repair. Nothing in the logs > that looks related. > > Where this occurs is arbitrary. If I run on individual CFs within > ourkeyspace, some will succeed, and some will fail, but if we start over > and do the 4-node to 8-node expansion again, things will fail at a > different place. > > Advice as to what to look at next? > > Thanks, > > Dave >
