Disaster recovery question

graham sanderson Sat, 16 Nov 2013 05:49:26 -0800

We are currently looking to deploy on the 2.0 line of cassandra, but obviously 
are watching for bugs (we are currently on 2.0.2) - we are aware of a couple of 
interesting known bugs to be fixed in 2.0.3 and one in 2.1, but none have been 
observed (in production use cases) or are likely to affect our current proposed 
deployment.


I have a few general questions:

The first particular test we tried was to physically remove the SSD commit 
drive for one of the nodes whilst under HEAVY write load (maybe a few hundred 
MB/s of data to be replicated 3 times - 6 node single local data center) and 
also while running read performance tests.. We currently have both node (CQL3) 
and Astyanax (Thrift) clients.

Frankly everything was pretty good (no read/write failures or indeed (observed) 
latency issues) except, and maybe people can comment on any of these:

1) There were NO errors in the log on the node where we removed the commit log 
SSD drive - this surprised us (of course our ops monitoring would detect the 
downed disk too, but we hope to be able to look for ERROR level logging in 
system.log to cause alerts also)
2) The node with no commit log disk just kept writing to memtables, but:
3) This was causing major CMS GC issues which eventually caused the node to 
appear down (nodetool status) to all other nodes, and indeed it itself saw all 
other nodes as down. That said dynamic snitch and latency detection in clients 
seemed to prevent this being much of a problem other than it seems potentially 
undesirable from a server side standpoint.
4) nodetool gossipinfo didn’t report anything abnormal for any nodes when run 
from any node.

Sadly because of an Astyanax issue (we were using the thrift code path that 
does a (now unnecessary) describe cluster to check for schema disagreement 
before schema changes) we weren’t able to create a new CF with a node marked 
down, and thus couldn’t immediately add more data to see what would have 
happened: EOM or failure (we have since fixed this to go thru CQL3 code path 
but not yet re-run the tests because of other application level testing going 
on)… that said maybe someone knows off the top of their head if there is a 
config setting that would start failing writes (due to memtable size) before GC 
became an issue, and we just have this misconfigured.

Secondly, our test was perhaps unrealistic in that when we brought the node 
back up, we did so with the partial commit log on the replaced disk intact (but 
the memory data lost), but we did get the following sorts of errors:

At level 1, 
SSTableReader(path='/data/2/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-12-Data.db')
 [DecoratedKey(3508309769529441563, 2d37363730383735353837333637383432323934), 
DecoratedKey(9158434231083901894, 343934353436393734343637393130393335)] 
overlaps 
SSTableReader(path='/data/5/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-6-Data.db')
 [DecoratedKey(7446234284568345539, 33393230303730373632303838373837373436), 
DecoratedKey(9158426253052616687, 2d313430303837343831393637343030313136)].  
This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact 
that you have dropped sstables from another node into the data directory. 
Sending back to L0.  If you didn't drop in sstables, and have not yet run 
scrub, you should do so since you may also have rows out-of-order within an 
sstable

5) I guess the question is what is the best way to bring up a failed node 
        a) delete all data first? 
        b) clear data but restore from previous sstable from backup to miminise 
subsequent data transfer
        c) other suggestions

6) Our experience is that taking nodes down that have problems, then deleting 
data (subsets if we can see partial corruption) and re-adding is much safer 
(but our cluster is VERY fast). That said can we re-sync data before 
re-enabling gossip, or at least before serving read requests from those nodes 
(not a huge issue but it would mitigate consistency issues with partially 
recovered data in the case that multiple quorum read members were recovering) - 
note we fallback from (LOCAL_)QUORUM to (LOCAL_ONE) on UnavaibleException, so 
have less guarantee compared with both writing and reading at LOCAL_QUORUM 
(note that if our LOCAL_QUORUM writes fail we will just retry when the cluster 
is fixed - stale data is not ideal but OK for a while)

That said given that the commit log on disk pre-dated any uncommitted lost 
memtable data, it seems that we shouldn’t have seen exceptions because this is 
kind of like 5)b) in that it should have gotten us closer to the correct state 
before the rest of the data was repaired rather than causing any weirdness 
(unless it was a missed fsync problem), but maybe I’m being naive.

Sorry for the long post, any thoughts would be appreciated.

Thanks,

Graham.

smime.p7s
Description: S/MIME cryptographic signature

Disaster recovery question

Reply via email to