Re: Cassandra stalls and dropped messages not due to GC

Graham Sanderson Thu, 29 Oct 2015 21:04:31 -0700

Only if you actually change cassandra.yaml (that was the change in 2.1.6 which 
is why it matters what version he upgraded from)


> On Oct 29, 2015, at 10:06 PM, Sebastian Estevez 
> <sebastian.este...@datastax.com> wrote:
> 
> The thing about the CASSANDRA-9504 theory is that it was solved in 2.1.6 and 
> Jeff's running 2.1.11.
> 
> @Jeff
> 
> How often does this happen? Can you watch ttop as soon as you notice 
> increased read/write latencies?
> 
> wget https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar
>  
> <https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar>java
>  -jar sjk-plus-0.3.6.jar ttop -s localhost:7199 -n 30 -o CPU
> 
> This should at least tell you which Cassandra threads are causing high memory 
> allocations  and CPU consumption.
> 
> All the best,
> 
>  <http://www.datastax.com/>
> Sebastián Estévez
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com 
> <mailto:sebastian.este...@datastax.com>
>  <https://www.linkedin.com/company/datastax>  
> <https://www.facebook.com/datastax>  <https://twitter.com/datastax>  
> <https://plus.google.com/+Datastax/about>  
> <http://feeds.feedburner.com/datastax>
>  <http://goog_410786983/>
> 
>  <http://www.datastax.com/gartner-magic-quadrant-odbms>
> 
> DataStax is the fastest, most scalable distributed database technology, 
> delivering Apache Cassandra to the world’s most innovative enterprises. 
> Datastax is built to be agile, always-on, and predictably scalable to any 
> size. With more than 500 customers in 45 countries, DataStax is the database 
> technology and transactional backbone of choice for the worlds most 
> innovative companies such as Netflix, Adobe, Intuit, and eBay. 
> 
> On Thu, Oct 29, 2015 at 9:36 PM, Graham Sanderson <gra...@vast.com 
> <mailto:gra...@vast.com>> wrote:
> you didn’t say what you upgraded from, but if it is 2.0.x, then look at 
> CASSANDRA-9504
> 
> If so and you use
> commitlog_sync: batch
> Then you probably want to set
> commitlog_sync_batch_window_in_ms: 1 (or 2)
> Note I’m only slightly convinced this is the cause because of your 
> READ_REPAIR issues (though if you are dropping a lot of MUTATIONS under load 
> and your machines are overloaded, you’d be doing more READ_REPAIR than usual 
> probably)
> 
>> On Oct 29, 2015, at 8:12 PM, Jeff Ferland <j...@tubularlabs.com 
>> <mailto:j...@tubularlabs.com>> wrote:
>> 
>> Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66
>> 
>> We upgraded our cluster this weekend and have been having issues with 
>> dropped mutations since then. Intensely investigating a single node and 
>> toying with settings has revealed that GC stalls don’t make up enough time 
>> to explain the 10 seconds of apparent stall that would cause a hangup.
>> 
>> tpstats output typically shows active threads in the low single digits and 
>> pending similar or 0. Before a failure, pending MutationStage will skyrocket 
>> into 5+ digits. System.log regularly shows the gossiper complaining, then 
>> slow log complaints, then logs dropped mutations.
>> 
>> For the entire minute of 00:55, the gc logging shows no single pause > .14 
>> seconds and most of them much smaller. Abbreviated GC log after switching to 
>> g1gc (problem also exhibited before G1GC):
>> 
>> 2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 
>> System log snippet of the pattern I’m seeing:
>> 
>> WARN  [GossipTasks:1] 2015-10-30 00:55:25,129  Gossiper.java:747 - Gossip 
>> stage has 1 pending tasks; skipping status check (no nodes will be marked 
>> down)
>> INFO  [CompactionExecutor:210] 2015-10-30 00:55:26,006  
>> CompactionTask.java:141 - Compacting 
>> [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'),
>>  
>> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'),
>>  
>> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'),
>>  
>> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'),
>>  
>> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')]
>> WARN  [GossipTasks:1] 2015-10-30 00:55:26,230  Gossiper.java:747 - Gossip 
>> stage has 3 pending tasks; skipping status check (no nodes will be marked 
>> down)
>> WARN  [GossipTasks:1] 2015-10-30 00:55:27,330  Gossiper.java:747 - Gossip 
>> stage has 5 pending tasks; skipping status check (no nodes will be marked 
>> down)
>> WARN  [GossipTasks:1] 2015-10-30 00:55:28,431  Gossiper.java:747 - Gossip 
>> stage has 7 pending tasks; skipping status check (no nodes will be marked 
>> down)
>> WARN  [GossipTasks:1] 2015-10-30 00:55:29,531  Gossiper.java:747 - Gossip 
>> stage has 10 pending tasks; skipping status check (no nodes will be marked 
>> down)
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16045 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16041 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16040 in 
>> slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461  
>> CqlSlowLogWriter.java:151 - Recording statements with duration of 16037 in 
>> slow log
>> WARN  [GossipTasks:1] 2015-10-30 00:55:32,697  FailureDetector.java:257 - 
>> Not marking nodes down due to local pause of 8668629909 <tel:8668629909> > 
>> 5000000000
>> INFO  [CompactionExecutor:212] 2015-10-30 00:55:32,864  
>> ColumnFamilyStore.java:900 - Enqueuing flush of compactions_in_progress: 164 
>> (0%) on-heap, 0 (0%) off-heap
>> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:347 - 
>> Writing Memtable-compactions_in_progress@969357824(0.008KiB serialized 
>> bytes, 1 ops, 0%/0% of on/off-heap limit)
>> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:382 - 
>> Completed flushing 
>> /mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db
>>  (0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, 
>> position=8699402)
>> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,346  MessagingService.java:930 
>> - MUTATION messages were dropped in last 5000 ms: 67432 for internal timeout 
>> and 0 for cross node timeout
>> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,350  MessagingService.java:930 
>> - READ_REPAIR messages were dropped in last 5000 ms: 209 for internal 
>> timeout and 0 for cross node timeout
>> 
>> I’m starting to suspect I’ll have to take regular stack dumps and catch a 
>> bad moment, but I figured I’d reach out here before going down that road to 
>> see if somebody else had ever experienced a similar kind of pain.
>> 
>> -Jeff
>> 
> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Cassandra stalls and dropped messages not due to GC

Reply via email to