Hi all, Just updated the ticket. It turned out it was libjemalloc segfaulting the JVM. Regardless of the Java version (tried to update but no improvement), new C* versions (maybe because they preload libjemalloc by default) seem to be affected.
Cheers, Stefano On Thu, May 5, 2016 at 5:01 PM, Stefano Ortolani <ostef...@gmail.com> wrote: > Hi, > > I am experiencing some weird behaviors after upgrading 2 nodes (out of 13) > to C* 3.0.5 (from 2.1.11). Basically, after restarting a second time, there > is a small chance that the node will die without outputting anything to the > logs (not even dmesg). > > This happened on both nodes I upgraded. The only "anomalies" I see in the > logs (although not related to the moment a node dies) are: > > * Lots of the following messages against all IPs of the cluster (every > second) > > DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - > Ignoring interval time of 2540341017 for /x.y.b.5 > DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - > Ignoring interval time of 2000551507 for /x.y.a.7 > DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - > Ignoring interval time of 2000479104 for /x.y.a.3 > DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - > Ignoring interval time of 2000471247 for /x.y.b.3 > DEBUG [GossipStage:1] 2016-05-05 23:52:03,259 FailureDetector.java:456 - > Ignoring interval time of 2000605748 for /x.y.a.5 > DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 - > Ignoring interval time of 2000731307 for /x.y.b.6 > DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 - > Ignoring interval time of 3000404107 for /x.y.b.1 > > * Some metrics are not being pushed to graphite (but some do get to the > server). Also, every time the node tries to push them I can see the > following error in the logs: > > ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 23:53:37,770 > ScheduledReporter.java:119 - RuntimeException thrown from > GraphiteReporter#report. Exception was suppressed. > java.lang.IllegalStateException: Unable to compute ceiling for max when > histogram overflowed > at > org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252) > ~[metrics-graphite-3.1.0.jar:3.1.0] > at > com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166) > ~[metrics-graphite-3.1.0.jar:3.1.0] > at > com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) > ~[metrics-core-3.1.0.jar:3.1.0] > at > com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) > ~[metrics-core-3.1.0.jar:3.1.0] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_60] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > [na:1.8.0_60] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > [na:1.8.0_60] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > [na:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_60] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > > Besides these, logs are clean. I've opened a ticket here ( > https://issues.apache.org/jira/browse/CASSANDRA-11723) but any help > debugging this is more than welcome. > > Regards, > Stefano > >