Re: Disaster recovery question

Aaron Morton Wed, 20 Nov 2013 23:21:28 -0800

>> The first particular test we tried 
What as the disk_failure_policy  setting ?


>> 1) There were NO errors in the log on the node where we removed the commit 
>> log SSD drive - this surprised us (of course our ops monitoring would detect 
>> the downed disk too, but we hope to be able to look for ERROR level logging 
>> in system.log to cause alerts also)
Can you reproduce this without needing to physically pull the drive ? 
Obviously there should be an error or warning there. Even if the 
disk_failure_policy says to ignore it should still log. 

>> 2) The node with no commit log disk just kept writing to memtables, but:
>> 3) This was causing major CMS GC issues which eventually caused the node to 
>> appear down (nodetool status) to all other nodes, and indeed it itself saw 
>> all other nodes as down. That said dynamic snitch and latency detection in 
>> clients seemed to prevent this being much of a problem other than it seems 
>> potentially undesirable from a server side standpoint.
The commit log has a queue that is 1024 * num processes long. If the write 
thread can get into this queue it will proceed (when using periodic commit 
log), so if there was no error I would expect writes to work for a little. But 
eventually this queue will get full and the write threads will not be able to 
proceed. The queue for the Mutation stage is essentially unbounded, so while 
the other nodes are sending writes it will continue to fill up. Leading to the 
CMS issues. 

Seeing nodes as down is a side effect of JVM GC preventing the Gossip threads 
from running frequently enough. 
 
>>  that said maybe someone knows off the top of their head if there is a 
>> config setting that would start failing writes (due to memtable size) before 
>> GC became an issue, and we just have this misconfigured.
Nope. 
Cassandra does not have an explicit back pressure mechanism. The best we have 
is the dynamic snitch and the gossip to eventually mark a node as down. 

>> 5) I guess the question is what is the best way to bring up a failed node 
>>      a) delete all data first? 
>>      b) clear data but restore from previous sstable from backup to miminise 
>> subsequent data transfer
>>      c) other suggestions
It depends on the failure. In your example I would have brought it back either 
with or without the commit log, or with the commit log except the most recently 
modified file. There is protection in the commit log reply to only reply 
mutations that match the crc check. When it was back online I would run a 
repair (without -pr) to repair all the data on the node. 

I’m not sure the level DB error has to do with the commit log reply. 


>> 6) Our experience is that taking nodes down that have problems, then 
>> deleting data (subsets if we can see partial corruption) and re-adding is 
>> much safer (but our cluster is VERY fast). 
You should not need to do this, what sort of corruptions ? 


Hope that helps. 

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 17/11/2013, at 3:56 pm, graham sanderson <gra...@vast.com> wrote:

> agreed; that was a parallel issue from our ops (I apologize and will try to 
> avoid duplicates) - I was asking the question from the architecture side as 
> to what should happen rather than describing it as a bug. Nonetheless, I/We 
> are still curious if anyone has an answer.
> 
> On Nov 16, 2013, at 6:13 PM, Mikhail Stepura <mikhail.step...@outlook.com> 
> wrote:
> 
>> Looks like someone has the same (1-4) questions:
>> https://issues.apache.org/jira/browse/CASSANDRA-6364
>> 
>> -M
>> 
>> "graham sanderson"  wrote in message 
>> news:7161e7e0-cf24-4b30-b9ca-2faafb0c4...@vast.com...
>> 
>> We are currently looking to deploy on the 2.0 line of cassandra, but 
>> obviously are watching for bugs (we are currently on 2.0.2) - we are aware 
>> of a couple of interesting known bugs to be fixed in 2.0.3 and one in 2.1, 
>> but none have been observed (in production use cases) or are likely to 
>> affect our current proposed deployment.
>> 
>> I have a few general questions:
>> 
>> The first particular test we tried was to physically remove the SSD commit 
>> drive for one of the nodes whilst under HEAVY write load (maybe a few 
>> hundred MB/s of data to be replicated 3 times - 6 node single local data 
>> center) and also while running read performance tests.. We currently have 
>> both node (CQL3) and Astyanax (Thrift) clients.
>> 
>> Frankly everything was pretty good (no read/write failures or indeed 
>> (observed) latency issues) except, and maybe people can comment on any of 
>> these:
>> 
>> 1) There were NO errors in the log on the node where we removed the commit 
>> log SSD drive - this surprised us (of course our ops monitoring would detect 
>> the downed disk too, but we hope to be able to look for ERROR level logging 
>> in system.log to cause alerts also)
>> 2) The node with no commit log disk just kept writing to memtables, but:
>> 3) This was causing major CMS GC issues which eventually caused the node to 
>> appear down (nodetool status) to all other nodes, and indeed it itself saw 
>> all other nodes as down. That said dynamic snitch and latency detection in 
>> clients seemed to prevent this being much of a problem other than it seems 
>> potentially undesirable from a server side standpoint.
>> 4) nodetool gossipinfo didn▓t report anything abnormal for any nodes when 
>> run from any node.
>> 
>> Sadly because of an Astyanax issue (we were using the thrift code path that 
>> does a (now unnecessary) describe cluster to check for schema disagreement 
>> before schema changes) we weren▓t able to create a new CF with a node marked 
>> down, and thus couldn▓t immediately add more data to see what would have 
>> happened: EOM or failure (we have since fixed this to go thru CQL3 code path 
>> but not yet re-run the tests because of other application level testing 
>> going on)┘ that said maybe someone knows off the top of their head if there 
>> is a config setting that would start failing writes (due to memtable size) 
>> before GC became an issue, and we just have this misconfigured.
>> 
>> Secondly, our test was perhaps unrealistic in that when we brought the node 
>> back up, we did so with the partial commit log on the replaced disk intact 
>> (but the memory data lost), but we did get the following sorts of errors:
>> 
>> At level 1, 
>> SSTableReader(path='/data/2/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-12-Data.db')
>>  [DecoratedKey(3508309769529441563, 
>> 2d37363730383735353837333637383432323934), DecoratedKey(9158434231083901894, 
>> 343934353436393734343637393130393335)] overlaps 
>> SSTableReader(path='/data/5/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-6-Data.db')
>>  [DecoratedKey(7446234284568345539, 33393230303730373632303838373837373436), 
>> DecoratedKey(9158426253052616687, 2d313430303837343831393637343030313136)]. 
>> This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact 
>> that you have dropped sstables from another node into the data directory. 
>> Sending back to L0.  If you didn't drop in sstables, and have not yet run 
>> scrub, you should do so since you may also have rows out-of-order within an 
>> sstable
>> 
>> 5) I guess the question is what is the best way to bring up a failed node
>> a) delete all data first?
>> b) clear data but restore from previous sstable from backup to miminise 
>> subsequent data transfer
>> c) other suggestions
>> 
>> 6) Our experience is that taking nodes down that have problems, then 
>> deleting data (subsets if we can see partial corruption) and re-adding is 
>> much safer (but our cluster is VERY fast). That said can we re-sync data 
>> before re-enabling gossip, or at least before serving read requests from 
>> those nodes (not a huge issue but it would mitigate consistency issues with 
>> partially recovered data in the case that multiple quorum read members were 
>> recovering) - note we fallback from (LOCAL_)QUORUM to (LOCAL_ONE) on 
>> UnavaibleException, so have less guarantee compared with both writing and 
>> reading at LOCAL_QUORUM (note that if our LOCAL_QUORUM writes fail we will 
>> just retry when the cluster is fixed - stale data is not ideal but OK for a 
>> while)
>> 
>> That said given that the commit log on disk pre-dated any uncommitted lost 
>> memtable data, it seems that we shouldn▓t have seen exceptions because this 
>> is kind of like 5)b) in that it should have gotten us closer to the correct 
>> state before the rest of the data was repaired rather than causing any 
>> weirdness (unless it was a missed fsync problem), but maybe I▓m being naive.
>> 
>> Sorry for the long post, any thoughts would be appreciated.
>> 
>> Thanks,
>> 
>> Graham. 
>> 
>

Re: Disaster recovery question

Reply via email to