Re: Upgrade from 2.1.11 to 3.0.5 leads to unstable nodes

Stefano Ortolani Fri, 06 May 2016 17:47:39 -0700

Hi all,

Just updated the ticket. It turned out it was libjemalloc segfaulting the
JVM.
Regardless of the Java version (tried to update but no improvement), new C*
versions (maybe because they preload libjemalloc by default) seem to be
affected.


Cheers,
Stefano

On Thu, May 5, 2016 at 5:01 PM, Stefano Ortolani <ostef...@gmail.com> wrote:

> Hi,
>
> I am experiencing some weird behaviors after upgrading 2 nodes (out of 13)
> to C* 3.0.5 (from 2.1.11). Basically, after restarting a second time, there
> is a small chance that the node will die without outputting anything to the
> logs (not even dmesg).
>
> This happened on both nodes I upgraded. The only "anomalies" I see in the
> logs (although not related to the moment a node dies) are:
>
> * Lots of the following messages against all IPs of the cluster (every
> second)
>
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2540341017 for /x.y.b.5
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2000551507 for /x.y.a.7
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2000479104 for /x.y.a.3
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2000471247 for /x.y.b.3
> DEBUG [GossipStage:1] 2016-05-05 23:52:03,259 FailureDetector.java:456 -
> Ignoring interval time of 2000605748 for /x.y.a.5
> DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 -
> Ignoring interval time of 2000731307 for /x.y.b.6
> DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 -
> Ignoring interval time of 3000404107 for /x.y.b.1
>
> * Some metrics are not being pushed to graphite (but some do get to the
> server). Also, every time the node tries to push them I can see the
> following error in the logs:
>
> ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 23:53:37,770
> ScheduledReporter.java:119 - RuntimeException thrown from
> GraphiteReporter#report. Exception was suppressed.
> java.lang.IllegalStateException: Unable to compute ceiling for max when
> histogram overflowed
> at
> org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231)
> ~[apache-cassandra-3.0.5.jar:3.0.5]
> at
> org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103)
> ~[apache-cassandra-3.0.5.jar:3.0.5]
> at
> com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252)
> ~[metrics-graphite-3.1.0.jar:3.1.0]
> at
> com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166)
> ~[metrics-graphite-3.1.0.jar:3.1.0]
> at
> com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162)
> ~[metrics-core-3.1.0.jar:3.1.0]
> at
> com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117)
> ~[metrics-core-3.1.0.jar:3.1.0]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_60]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [na:1.8.0_60]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_60]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> [na:1.8.0_60]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_60]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_60]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
>
> Besides these, logs are clean. I've opened a ticket here (
> https://issues.apache.org/jira/browse/CASSANDRA-11723) but any help
> debugging this is more than welcome.
>
> Regards,
> Stefano
>
>

Re: Upgrade from 2.1.11 to 3.0.5 leads to unstable nodes

Reply via email to