Only if you actually change cassandra.yaml (that was the change in 2.1.6 which is why it matters what version he upgraded from)
> On Oct 29, 2015, at 10:06 PM, Sebastian Estevez > <sebastian.este...@datastax.com> wrote: > > The thing about the CASSANDRA-9504 theory is that it was solved in 2.1.6 and > Jeff's running 2.1.11. > > @Jeff > > How often does this happen? Can you watch ttop as soon as you notice > increased read/write latencies? > > wget https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar > > <https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar>java > -jar sjk-plus-0.3.6.jar ttop -s localhost:7199 -n 30 -o CPU > > This should at least tell you which Cassandra threads are causing high memory > allocations and CPU consumption. > > All the best, > > <http://www.datastax.com/> > Sebastián Estévez > Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com > <mailto:sebastian.este...@datastax.com> > <https://www.linkedin.com/company/datastax> > <https://www.facebook.com/datastax> <https://twitter.com/datastax> > <https://plus.google.com/+Datastax/about> > <http://feeds.feedburner.com/datastax> > <http://goog_410786983/> > > <http://www.datastax.com/gartner-magic-quadrant-odbms> > > DataStax is the fastest, most scalable distributed database technology, > delivering Apache Cassandra to the world’s most innovative enterprises. > Datastax is built to be agile, always-on, and predictably scalable to any > size. With more than 500 customers in 45 countries, DataStax is the database > technology and transactional backbone of choice for the worlds most > innovative companies such as Netflix, Adobe, Intuit, and eBay. > > On Thu, Oct 29, 2015 at 9:36 PM, Graham Sanderson <gra...@vast.com > <mailto:gra...@vast.com>> wrote: > you didn’t say what you upgraded from, but if it is 2.0.x, then look at > CASSANDRA-9504 > > If so and you use > commitlog_sync: batch > Then you probably want to set > commitlog_sync_batch_window_in_ms: 1 (or 2) > Note I’m only slightly convinced this is the cause because of your > READ_REPAIR issues (though if you are dropping a lot of MUTATIONS under load > and your machines are overloaded, you’d be doing more READ_REPAIR than usual > probably) > >> On Oct 29, 2015, at 8:12 PM, Jeff Ferland <j...@tubularlabs.com >> <mailto:j...@tubularlabs.com>> wrote: >> >> Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66 >> >> We upgraded our cluster this weekend and have been having issues with >> dropped mutations since then. Intensely investigating a single node and >> toying with settings has revealed that GC stalls don’t make up enough time >> to explain the 10 seconds of apparent stall that would cause a hangup. >> >> tpstats output typically shows active threads in the low single digits and >> pending similar or 0. Before a failure, pending MutationStage will skyrocket >> into 5+ digits. System.log regularly shows the gossiper complaining, then >> slow log complaints, then logs dropped mutations. >> >> For the entire minute of 00:55, the gc logging shows no single pause > .14 >> seconds and most of them much smaller. Abbreviated GC log after switching to >> g1gc (problem also exhibited before G1GC): >> >> 2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) >> (young) >> 2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) >> (young) >> 2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) >> (young) >> 2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) >> (young) >> 2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) >> (young) >> 2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) >> (young) >> >> System log snippet of the pattern I’m seeing: >> >> WARN [GossipTasks:1] 2015-10-30 00:55:25,129 Gossiper.java:747 - Gossip >> stage has 1 pending tasks; skipping status check (no nodes will be marked >> down) >> INFO [CompactionExecutor:210] 2015-10-30 00:55:26,006 >> CompactionTask.java:141 - Compacting >> [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'), >> >> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'), >> >> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'), >> >> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'), >> >> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')] >> WARN [GossipTasks:1] 2015-10-30 00:55:26,230 Gossiper.java:747 - Gossip >> stage has 3 pending tasks; skipping status check (no nodes will be marked >> down) >> WARN [GossipTasks:1] 2015-10-30 00:55:27,330 Gossiper.java:747 - Gossip >> stage has 5 pending tasks; skipping status check (no nodes will be marked >> down) >> WARN [GossipTasks:1] 2015-10-30 00:55:28,431 Gossiper.java:747 - Gossip >> stage has 7 pending tasks; skipping status check (no nodes will be marked >> down) >> WARN [GossipTasks:1] 2015-10-30 00:55:29,531 Gossiper.java:747 - Gossip >> stage has 10 pending tasks; skipping status check (no nodes will be marked >> down) >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16045 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16041 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16040 in >> slow log >> INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461 >> CqlSlowLogWriter.java:151 - Recording statements with duration of 16037 in >> slow log >> WARN [GossipTasks:1] 2015-10-30 00:55:32,697 FailureDetector.java:257 - >> Not marking nodes down due to local pause of 8668629909 <tel:8668629909> > >> 5000000000 >> INFO [CompactionExecutor:212] 2015-10-30 00:55:32,864 >> ColumnFamilyStore.java:900 - Enqueuing flush of compactions_in_progress: 164 >> (0%) on-heap, 0 (0%) off-heap >> INFO [MemtableFlushWriter:846] 2015-10-30 00:55:32,865 Memtable.java:347 - >> Writing Memtable-compactions_in_progress@969357824(0.008KiB serialized >> bytes, 1 ops, 0%/0% of on/off-heap limit) >> INFO [MemtableFlushWriter:846] 2015-10-30 00:55:32,865 Memtable.java:382 - >> Completed flushing >> /mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db >> (0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, >> position=8699402) >> INFO [ScheduledTasks:1] 2015-10-30 00:55:35,346 MessagingService.java:930 >> - MUTATION messages were dropped in last 5000 ms: 67432 for internal timeout >> and 0 for cross node timeout >> INFO [ScheduledTasks:1] 2015-10-30 00:55:35,350 MessagingService.java:930 >> - READ_REPAIR messages were dropped in last 5000 ms: 209 for internal >> timeout and 0 for cross node timeout >> >> I’m starting to suspect I’ll have to take regular stack dumps and catch a >> bad moment, but I figured I’d reach out here before going down that road to >> see if somebody else had ever experienced a similar kind of pain. >> >> -Jeff >> > >
smime.p7s
Description: S/MIME cryptographic signature