Re: Disaster recovery question

graham sanderson Sat, 16 Nov 2013 18:57:38 -0800

agreed; that was a parallel issue from our ops (I apologize and will try to 
avoid duplicates) - I was asking the question from the architecture side as to 
what should happen rather than describing it as a bug. Nonetheless, I/We are 
still curious if anyone has an answer.


On Nov 16, 2013, at 6:13 PM, Mikhail Stepura <mikhail.step...@outlook.com> 
wrote:

> Looks like someone has the same (1-4) questions:
> https://issues.apache.org/jira/browse/CASSANDRA-6364
> 
> -M
> 
> "graham sanderson"  wrote in message 
> news:7161e7e0-cf24-4b30-b9ca-2faafb0c4...@vast.com...
> 
> We are currently looking to deploy on the 2.0 line of cassandra, but 
> obviously are watching for bugs (we are currently on 2.0.2) - we are aware of 
> a couple of interesting known bugs to be fixed in 2.0.3 and one in 2.1, but 
> none have been observed (in production use cases) or are likely to affect our 
> current proposed deployment.
> 
> I have a few general questions:
> 
> The first particular test we tried was to physically remove the SSD commit 
> drive for one of the nodes whilst under HEAVY write load (maybe a few hundred 
> MB/s of data to be replicated 3 times - 6 node single local data center) and 
> also while running read performance tests.. We currently have both node 
> (CQL3) and Astyanax (Thrift) clients.
> 
> Frankly everything was pretty good (no read/write failures or indeed 
> (observed) latency issues) except, and maybe people can comment on any of 
> these:
> 
> 1) There were NO errors in the log on the node where we removed the commit 
> log SSD drive - this surprised us (of course our ops monitoring would detect 
> the downed disk too, but we hope to be able to look for ERROR level logging 
> in system.log to cause alerts also)
> 2) The node with no commit log disk just kept writing to memtables, but:
> 3) This was causing major CMS GC issues which eventually caused the node to 
> appear down (nodetool status) to all other nodes, and indeed it itself saw 
> all other nodes as down. That said dynamic snitch and latency detection in 
> clients seemed to prevent this being much of a problem other than it seems 
> potentially undesirable from a server side standpoint.
> 4) nodetool gossipinfo didn▓t report anything abnormal for any nodes when run 
> from any node.
> 
> Sadly because of an Astyanax issue (we were using the thrift code path that 
> does a (now unnecessary) describe cluster to check for schema disagreement 
> before schema changes) we weren▓t able to create a new CF with a node marked 
> down, and thus couldn▓t immediately add more data to see what would have 
> happened: EOM or failure (we have since fixed this to go thru CQL3 code path 
> but not yet re-run the tests because of other application level testing going 
> on)┘ that said maybe someone knows off the top of their head if there is a 
> config setting that would start failing writes (due to memtable size) before 
> GC became an issue, and we just have this misconfigured.
> 
> Secondly, our test was perhaps unrealistic in that when we brought the node 
> back up, we did so with the partial commit log on the replaced disk intact 
> (but the memory data lost), but we did get the following sorts of errors:
> 
> At level 1, 
> SSTableReader(path='/data/2/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-12-Data.db')
>  [DecoratedKey(3508309769529441563, 
> 2d37363730383735353837333637383432323934), DecoratedKey(9158434231083901894, 
> 343934353436393734343637393130393335)] overlaps 
> SSTableReader(path='/data/5/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-6-Data.db')
>  [DecoratedKey(7446234284568345539, 33393230303730373632303838373837373436), 
> DecoratedKey(9158426253052616687, 2d313430303837343831393637343030313136)]. 
> This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact 
> that you have dropped sstables from another node into the data directory. 
> Sending back to L0.  If you didn't drop in sstables, and have not yet run 
> scrub, you should do so since you may also have rows out-of-order within an 
> sstable
> 
> 5) I guess the question is what is the best way to bring up a failed node
> a) delete all data first?
> b) clear data but restore from previous sstable from backup to miminise 
> subsequent data transfer
> c) other suggestions
> 
> 6) Our experience is that taking nodes down that have problems, then deleting 
> data (subsets if we can see partial corruption) and re-adding is much safer 
> (but our cluster is VERY fast). That said can we re-sync data before 
> re-enabling gossip, or at least before serving read requests from those nodes 
> (not a huge issue but it would mitigate consistency issues with partially 
> recovered data in the case that multiple quorum read members were recovering) 
> - note we fallback from (LOCAL_)QUORUM to (LOCAL_ONE) on UnavaibleException, 
> so have less guarantee compared with both writing and reading at LOCAL_QUORUM 
> (note that if our LOCAL_QUORUM writes fail we will just retry when the 
> cluster is fixed - stale data is not ideal but OK for a while)
> 
> That said given that the commit log on disk pre-dated any uncommitted lost 
> memtable data, it seems that we shouldn▓t have seen exceptions because this 
> is kind of like 5)b) in that it should have gotten us closer to the correct 
> state before the rest of the data was repaired rather than causing any 
> weirdness (unless it was a missed fsync problem), but maybe I▓m being naive.
> 
> Sorry for the long post, any thoughts would be appreciated.
> 
> Thanks,
> 
> Graham. 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Disaster recovery question

Reply via email to