Re: Cassandra's bad behavior on disk failure

Schubert Zhang Wed, 28 Apr 2010 21:23:32 -0700

On Wed, Apr 21, 2010 at 10:08 PM, Oleg Anastasjev <olega...@gmail.com>wrote:


> Hello,
>
> I am testing how cassandra behaves on single node disk failures to know
> what to
> expect when things go bad.
> I had a cluster of 4 cassandra nodes, stress loaded it with client and made
> 2
> tests:
> 1. emulated disk failure of /data volume on read only stress test
> 2. emulated disk failure of /commitlog volumn on write intensive test
>
> Good test.


> 1. On read test with data volume down, a lot of
> "org.apache.thrift.TApplicationException: Internal error processing
> get_slice"
> was logged at client side. On cassandra server logged alot of IOExceptions
> reading every *.db file it has. Node continued to show as UP in ring.
>
> OK, the behavior is not ideal, but still can be worked around at client
> side,
> throwing out nodes as soon as TApplicationException is received from
> cassandra.
>
> [schubert] Usually, we should use RAID to avoud disk failure.
And add some system monitors to maintain/shutdown node.


> 2. Much worse was with write test:
> No exception was seen at client, writes are going through normally, but
> PERIODIC-COMMIT-LOG-SYNCER failed to sync commit logs, heap of node quickly
> became full and node freezed in GC loop. Still, it continued to show as UP
> in
> ring.
>
> [schubert] I think this is also a bad implementation of current Cassandra
on CommitLogSync.
The default config is <CommitLogSync>periodic</CommitLogSync>
The write commit-log will be responsed immediately, but only buffered in
memory, and will be synced to disk periodically according
<CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>.

In your case, the buffer will be immoderately increased, and use many and
many heap.

In 0.6.x, you can use batch CommitLogSync to alleviate this issue.
<CommitLogSync>batch</CommitLogSync>
<CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS>

I think the right design should be:
It should use a threshold to avoid too much buffered commitlog. The
buffer-time and buffer-size should also be the sync trigger:
When the commit-log buffer size threshold is reached, sync.
When the commit-log buffer time is reached, sync.



> This, i believe, is bad, because no quick workaround could be done at
> client
> side (no exceptions are coming from failed node) and in real system will
> lead to
> dramatic slow down of the whole cluster, because clients, not knowing, that
> node
> is actually dead, will direct 1/4th of requests to it and timeout.
>

I think that more correct behavior here could be halting cassandra server on
> any
> disk IO error, so clients can quickly detect this and failover to healthy
> servers.
>
> What do you think ?
>
> Did you guys experienced disk failure in production and how was it ?
>
>
>

Re: Cassandra's bad behavior on disk failure

Reply via email to