Re: Cassandra stalls and dropped messages not due to GC

Nate McCall Fri, 30 Oct 2015 15:38:52 -0700

Does tpstats show unusually high counts for blocked flush writers?

As Sebastian suggests, running ttop will paint a clearer picture about what
is happening within C*. I would however recommend going back to CMS in this
case as that is the devil we all know and more folks will be able to offer
advice on seeing its output (and it removes a delta).



> It’s starting to look to me like it’s possibly related to brief IO spikes
> that are smaller than my usual graphing granularity. It feels surprising to
> me that these would affect the Gossip threads, but it’s the best current
> lead I have with my debugging right now. More to come when I learn it.
>

Probably not the case since this was a result of an upgrade, but I've seen
similar behavior on systems where some kernels had issues with irqbalance
doing the right thing and would end up parking most interrupts on CPU0
(like say for the disk and ethernet modules) regardless of the number of
cores. Check out proc via 'cat /proc/interrupts' and make sure the
interrupts are spread out of CPU cores. You can steer them off manually at
runtime if they are not spread out.

Also, did you upgrade anything besides Cassandra?


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Cassandra stalls and dropped messages not due to GC

Reply via email to