agreed; that was a parallel issue from our ops (I apologize and will try to avoid duplicates) - I was asking the question from the architecture side as to what should happen rather than describing it as a bug. Nonetheless, I/We are still curious if anyone has an answer.
On Nov 16, 2013, at 6:13 PM, Mikhail Stepura <mikhail.step...@outlook.com> wrote: > Looks like someone has the same (1-4) questions: > https://issues.apache.org/jira/browse/CASSANDRA-6364 > > -M > > "graham sanderson" wrote in message > news:7161e7e0-cf24-4b30-b9ca-2faafb0c4...@vast.com... > > We are currently looking to deploy on the 2.0 line of cassandra, but > obviously are watching for bugs (we are currently on 2.0.2) - we are aware of > a couple of interesting known bugs to be fixed in 2.0.3 and one in 2.1, but > none have been observed (in production use cases) or are likely to affect our > current proposed deployment. > > I have a few general questions: > > The first particular test we tried was to physically remove the SSD commit > drive for one of the nodes whilst under HEAVY write load (maybe a few hundred > MB/s of data to be replicated 3 times - 6 node single local data center) and > also while running read performance tests.. We currently have both node > (CQL3) and Astyanax (Thrift) clients. > > Frankly everything was pretty good (no read/write failures or indeed > (observed) latency issues) except, and maybe people can comment on any of > these: > > 1) There were NO errors in the log on the node where we removed the commit > log SSD drive - this surprised us (of course our ops monitoring would detect > the downed disk too, but we hope to be able to look for ERROR level logging > in system.log to cause alerts also) > 2) The node with no commit log disk just kept writing to memtables, but: > 3) This was causing major CMS GC issues which eventually caused the node to > appear down (nodetool status) to all other nodes, and indeed it itself saw > all other nodes as down. That said dynamic snitch and latency detection in > clients seemed to prevent this being much of a problem other than it seems > potentially undesirable from a server side standpoint. > 4) nodetool gossipinfo didn▓t report anything abnormal for any nodes when run > from any node. > > Sadly because of an Astyanax issue (we were using the thrift code path that > does a (now unnecessary) describe cluster to check for schema disagreement > before schema changes) we weren▓t able to create a new CF with a node marked > down, and thus couldn▓t immediately add more data to see what would have > happened: EOM or failure (we have since fixed this to go thru CQL3 code path > but not yet re-run the tests because of other application level testing going > on)┘ that said maybe someone knows off the top of their head if there is a > config setting that would start failing writes (due to memtable size) before > GC became an issue, and we just have this misconfigured. > > Secondly, our test was perhaps unrealistic in that when we brought the node > back up, we did so with the partial commit log on the replaced disk intact > (but the memory data lost), but we did get the following sorts of errors: > > At level 1, > SSTableReader(path='/data/2/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-12-Data.db') > [DecoratedKey(3508309769529441563, > 2d37363730383735353837333637383432323934), DecoratedKey(9158434231083901894, > 343934353436393734343637393130393335)] overlaps > SSTableReader(path='/data/5/cassandra/searchapi_dsp_approved_feed_beta/20131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-20131113151746_20131113_140712_1384348032-jb-6-Data.db') > [DecoratedKey(7446234284568345539, 33393230303730373632303838373837373436), > DecoratedKey(9158426253052616687, 2d313430303837343831393637343030313136)]. > This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact > that you have dropped sstables from another node into the data directory. > Sending back to L0. If you didn't drop in sstables, and have not yet run > scrub, you should do so since you may also have rows out-of-order within an > sstable > > 5) I guess the question is what is the best way to bring up a failed node > a) delete all data first? > b) clear data but restore from previous sstable from backup to miminise > subsequent data transfer > c) other suggestions > > 6) Our experience is that taking nodes down that have problems, then deleting > data (subsets if we can see partial corruption) and re-adding is much safer > (but our cluster is VERY fast). That said can we re-sync data before > re-enabling gossip, or at least before serving read requests from those nodes > (not a huge issue but it would mitigate consistency issues with partially > recovered data in the case that multiple quorum read members were recovering) > - note we fallback from (LOCAL_)QUORUM to (LOCAL_ONE) on UnavaibleException, > so have less guarantee compared with both writing and reading at LOCAL_QUORUM > (note that if our LOCAL_QUORUM writes fail we will just retry when the > cluster is fixed - stale data is not ideal but OK for a while) > > That said given that the commit log on disk pre-dated any uncommitted lost > memtable data, it seems that we shouldn▓t have seen exceptions because this > is kind of like 5)b) in that it should have gotten us closer to the correct > state before the rest of the data was repaired rather than causing any > weirdness (unless it was a missed fsync problem), but maybe I▓m being naive. > > Sorry for the long post, any thoughts would be appreciated. > > Thanks, > > Graham. >
smime.p7s
Description: S/MIME cryptographic signature