Thank you for your advice. Rf>=2 is a good work around. I was using 0.7.4 and have updated to the latest 0.7 branch, which includes 2554 patch. But it doesn't help. I still get lots of UnavailableException after the following logs,
INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228) InetAddress /192.168.125.49 is now dead. INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609) InetAddress /192.168.125.49 is now UP INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java (line 304) Started hinted handoff for endpoint /192.168.125.49 INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java (line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49 It seems that the gossip failure detection is too sensitive. Is there any configuration? 2011/4/27 Sylvain Lebresne <sylv...@datastax.com> > On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen <chensheng2...@gmail.com> > wrote: > > I succeeded to insert 1 billion records into a single node cassandra, > >>> bin/stress -d cas01 -o insert -n 1000000000 -c 5 -S 34 -C5 -t 20 > > Inserts finished in about 14 hours at a speed of 20k/sec. > > But when I added another node, tests always failed with > UnavailableException > > in an hour. > >>> bin/stress -d cas01,cas02 -o insert -n 1000000000 -c 5 -S 34 -C5 -t 20 > > Writes speed is also 20k/sec because of the bottleneck in the client, so > the > > pressure on each server node should be 50% of the single node test. > > Why couldn't they handle? > > By default, rf=1, consistency=ONE > > Some information that may be helpful, > > 1. no warn/error in log file, the cluster is still alive after those > > exception > > 2. the last logs on both nodes happen to be a compaction complete info > > 3. gossip log shows one node is dead and then up again in 3 seconds > > That's your problem. Once marked down (and since rf=1), when an update for > cas02 reach cas01 and cas01 has marked cas02 down, it will throw the > UnavailableException. > > Now, it shouldn't have been marked down and I suspect this is due to > https://issues.apache.org/jira/browse/CASSANDRA-2554 > (even though you didn't tell which version you're using, I suppose > this is a 0.7.*). > > If you apply this patch or use the svn current 0.7 branch, that should > hopefully > not happen again. > > Note that if you had rf >= 2, the node would still have been marked down > wrongly > for 3 seconds, but that would have been transparent to the stress test. > > > 4. I set hinted_handoff_enabled: false, but still see lots of handoff > logs > > What are those saying ? > > -- > Sylvain >