Re: Cassandra stalls and dropped messages not due to GC

2015-11-02 Thread Jeff Ferland
Having caught a node in an undesirable state, many of my threads are reading like this: "SharedPool-Worker-5" #875 daemon prio=5 os_prio=0 tid=0x7f3e14196800 nid=0x96ce waiting on condition [0x7f3ddb835000] java.lang.Thread.State: WAITING (parking) at

Re: Cassandra stalls and dropped messages not due to GC

2015-11-02 Thread Nate McCall
> > > Forgive me, but what is CMS? > Sorry - ConcurrentMarkSweep garbage collector. > > No. I’ve tried some mitigations since tuning thread pool sizes and GC, but > the problem begins with only an upgrade of Cassandra. No other system > packages, kernels, etc. > > > >From what 2.0 version did

Re: Cassandra stalls and dropped messages not due to GC

2015-11-02 Thread Jeff Ferland
> On Nov 2, 2015, at 11:35 AM, Nate McCall wrote: > Forgive me, but what is CMS? > > Sorry - ConcurrentMarkSweep garbage collector. Ah, my brain was trying to think in terms of something Cassandra specific. I have full GC logging on and since moving to G1, I haven’t

Re: Cassandra stalls and dropped messages not due to GC

2015-10-30 Thread Nate McCall
Does tpstats show unusually high counts for blocked flush writers? As Sebastian suggests, running ttop will paint a clearer picture about what is happening within C*. I would however recommend going back to CMS in this case as that is the devil we all know and more folks will be able to offer

Cassandra stalls and dropped messages not due to GC

2015-10-29 Thread Jeff Ferland
Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66 We upgraded our cluster this weekend and have been having issues with dropped mutations since then. Intensely investigating a single node and toying with settings has revealed that GC stalls don’t make up enough time to explain the 10 seconds

Re: Cassandra stalls and dropped messages not due to GC

2015-10-29 Thread Sebastian Estevez
The thing about the CASSANDRA-9504 theory is that it was solved in 2.1.6 and Jeff's running 2.1.11. @Jeff How often does this happen? Can you watch ttop as soon as you notice increased read/write latencies? wget > https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar > java

Re: Cassandra stalls and dropped messages not due to GC

2015-10-29 Thread Jeff Ferland
Upgraded from 2.0.x. Using the other commit log sync method and 10 seconds. Enabling batch mode is like swallowing a grenade. It’s starting to look to me like it’s possibly related to brief IO spikes that are smaller than my usual graphing granularity. It feels surprising to me that these

Re: Cassandra stalls and dropped messages not due to GC

2015-10-29 Thread Graham Sanderson
Only if you actually change cassandra.yaml (that was the change in 2.1.6 which is why it matters what version he upgraded from) > On Oct 29, 2015, at 10:06 PM, Sebastian Estevez > wrote: > > The thing about the CASSANDRA-9504 theory is that it was solved in

Re: Cassandra stalls and dropped messages not due to GC

2015-10-29 Thread Graham Sanderson
you didn’t say what you upgraded from, but if it is 2.0.x, then look at CASSANDRA-9504 If so and you use commitlog_sync: batch Then you probably want to set commitlog_sync_batch_window_in_ms: 1 (or 2) Note I’m only slightly convinced this is the cause because of your READ_REPAIR issues (though