Re: Cassandra stalls and dropped messages not due to GC

Sebastian Estevez Thu, 29 Oct 2015 20:21:13 -0700

The thing about the CASSANDRA-9504 theory is that it was solved in 2.1.6
and Jeff's running 2.1.11.


@Jeff

How often does this happen? Can you watch ttop as soon as you notice
increased read/write latencies?

wget
> https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar
> java -jar sjk-plus-0.3.6.jar ttop -s localhost:7199 -n 30 -o CPU


This should at least tell you which Cassandra threads are causing high
memory allocations  and CPU consumption.

All the best,


[image: datastax_logo.png] <http://www.datastax.com/>

Sebastián Estévez

Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com

[image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
<https://twitter.com/datastax> [image: g+.png]
<https://plus.google.com/+Datastax/about>
<http://feeds.feedburner.com/datastax>
<http://goog_410786983>


<http://www.datastax.com/gartner-magic-quadrant-odbms>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

On Thu, Oct 29, 2015 at 9:36 PM, Graham Sanderson <gra...@vast.com> wrote:

> you didn’t say what you upgraded from, but if it is 2.0.x, then look at
> CASSANDRA-9504
>
> If so and you use
>
> commitlog_sync: batch
>
> Then you probably want to set
>
> commitlog_sync_batch_window_in_ms: 1 (or 2)
>
> Note I’m only slightly convinced this is the cause because of your
> READ_REPAIR issues (though if you are dropping a lot of MUTATIONS under
> load and your machines are overloaded, you’d be doing more READ_REPAIR than
> usual probably)
>
> On Oct 29, 2015, at 8:12 PM, Jeff Ferland <j...@tubularlabs.com> wrote:
>
> Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66
>
> We upgraded our cluster this weekend and have been having issues with
> dropped mutations since then. Intensely investigating a single node and
> toying with settings has revealed that GC stalls don’t make up enough time
> to explain the 10 seconds of apparent stall that would cause a hangup.
>
> tpstats output typically shows active threads in the low single digits and
> pending similar or 0. Before a failure, pending MutationStage will
> skyrocket into 5+ digits. System.log regularly shows the gossiper
> complaining, then slow log complaints, then logs dropped mutations.
>
> For the entire minute of 00:55, the gc logging shows no single pause > .14
> seconds and most of them much smaller. Abbreviated GC log after switching
> to g1gc (problem also exhibited before G1GC):
>
> 2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC)
> (young)
> 2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC)
> (young)
> 2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC)
> (young)
> 2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC)
> (young)
> 2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause)
> (young)
> 2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC)
> (young)
>
> System log snippet of the pattern I’m seeing:
>
> WARN  [GossipTasks:1] 2015-10-30 00:55:25,129  Gossiper.java:747 - Gossip
> stage has 1 pending tasks; skipping status check (no nodes will be marked
> down)
> INFO  [CompactionExecutor:210] 2015-10-30 00:55:26,006
>  CompactionTask.java:141 - Compacting
> [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'),
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'),
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'),
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'),
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')]
> WARN  [GossipTasks:1] 2015-10-30 00:55:26,230  Gossiper.java:747 - Gossip
> stage has 3 pending tasks; skipping status check (no nodes will be marked
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:27,330  Gossiper.java:747 - Gossip
> stage has 5 pending tasks; skipping status check (no nodes will be marked
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:28,431  Gossiper.java:747 - Gossip
> stage has 7 pending tasks; skipping status check (no nodes will be marked
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:29,531  Gossiper.java:747 - Gossip
> stage has 10 pending tasks; skipping status check (no nodes will be marked
> down)
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16045 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16041 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16040 in
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461
>  CqlSlowLogWriter.java:151 - Recording statements with duration of 16037 in
> slow log
> WARN  [GossipTasks:1] 2015-10-30 00:55:32,697  FailureDetector.java:257 -
> Not marking nodes down due to local pause of 8668629909 > 5000000000
> INFO  [CompactionExecutor:212] 2015-10-30 00:55:32,864
>  ColumnFamilyStore.java:900 - Enqueuing flush of compactions_in_progress:
> 164 (0%) on-heap, 0 (0%) off-heap
> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:347
> - Writing Memtable-compactions_in_progress@969357824(0.008KiB serialized
> bytes, 1 ops, 0%/0% of on/off-heap limit)
> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:382
> - Completed flushing
> /mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db
> (0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729,
> position=8699402)
> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,346
>  MessagingService.java:930 - MUTATION messages were dropped in last 5000
> ms: 67432 for internal timeout and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,350
>  MessagingService.java:930 - READ_REPAIR messages were dropped in last 5000
> ms: 209 for internal timeout and 0 for cross node timeout
>
> I’m starting to suspect I’ll have to take regular stack dumps and catch a
> bad moment, but I figured I’d reach out here before going down that road to
> see if somebody else had ever experienced a similar kind of pain.
>
> -Jeff
>
>
>

Re: Cassandra stalls and dropped messages not due to GC

Reply via email to