On Wed, Apr 21, 2010 at 10:08 PM, Oleg Anastasjev <olega...@gmail.com>wrote:
> Hello, > > I am testing how cassandra behaves on single node disk failures to know > what to > expect when things go bad. > I had a cluster of 4 cassandra nodes, stress loaded it with client and made > 2 > tests: > 1. emulated disk failure of /data volume on read only stress test > 2. emulated disk failure of /commitlog volumn on write intensive test > > Good test. > 1. On read test with data volume down, a lot of > "org.apache.thrift.TApplicationException: Internal error processing > get_slice" > was logged at client side. On cassandra server logged alot of IOExceptions > reading every *.db file it has. Node continued to show as UP in ring. > > OK, the behavior is not ideal, but still can be worked around at client > side, > throwing out nodes as soon as TApplicationException is received from > cassandra. > > [schubert] Usually, we should use RAID to avoud disk failure. And add some system monitors to maintain/shutdown node. > 2. Much worse was with write test: > No exception was seen at client, writes are going through normally, but > PERIODIC-COMMIT-LOG-SYNCER failed to sync commit logs, heap of node quickly > became full and node freezed in GC loop. Still, it continued to show as UP > in > ring. > > [schubert] I think this is also a bad implementation of current Cassandra on CommitLogSync. The default config is <CommitLogSync>periodic</CommitLogSync> The write commit-log will be responsed immediately, but only buffered in memory, and will be synced to disk periodically according <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>. In your case, the buffer will be immoderately increased, and use many and many heap. In 0.6.x, you can use batch CommitLogSync to alleviate this issue. <CommitLogSync>batch</CommitLogSync> <CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS> I think the right design should be: It should use a threshold to avoid too much buffered commitlog. The buffer-time and buffer-size should also be the sync trigger: When the commit-log buffer size threshold is reached, sync. When the commit-log buffer time is reached, sync. > This, i believe, is bad, because no quick workaround could be done at > client > side (no exceptions are coming from failed node) and in real system will > lead to > dramatic slow down of the whole cluster, because clients, not knowing, that > node > is actually dead, will direct 1/4th of requests to it and timeout. > I think that more correct behavior here could be halting cassandra server on > any > disk IO error, so clients can quickly detect this and failover to healthy > servers. > > What do you think ? > > Did you guys experienced disk failure in production and how was it ? > > >