Also, I see this issue only when I have more columnfamilies. looks like be number of vnodes * number of CF combination. does anyone have any idea on this?
On Tue, Oct 23, 2018 at 9:48 AM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > Did anyone run into similar issues? > > On Thu, Sep 6, 2018 at 10:27 AM Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > >> Here is the stacktrace from the failure, it looks like it's trying to >> gather all the columfamily metrics and going OOM. Is this just for the JMX >> metrics? >> >> >> https://github.com/apache/cassandra/blob/cassandra-2.1.16/src/java/org/apache/cassandra/metrics/ColumnFamilyMetrics.java >> >> ERROR [MessagingService-Incoming-/10.133.33.57] 2018-09-06 15:43:19,280 >> CassandraDaemon.java:231 - Exception in thread >> Thread[MessagingService-Incoming-/x.x.x.x,5,main] >> java.lang.OutOfMemoryError: Java heap space >> at java.io.DataInputStream.<init>(DataInputStream.java:58) >> ~[na:1.8.0_151] >> at >> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:139) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> ERROR [InternalResponseStage:1] 2018-09-06 15:43:19,281 >> CassandraDaemon.java:231 - Exception in thread >> Thread[InternalResponseStage:1,5,main] >> java.lang.OutOfMemoryError: Java heap space >> at >> org.apache.cassandra.metrics.ColumnFamilyMetrics$AllColumnFamilyMetricNameFactory.createMetricName( >> *ColumnFamilyMetrics.java:784*) ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.metrics.ColumnFamilyMetrics.createColumnFamilyHistogram(ColumnFamilyMetrics.java:716) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.metrics.ColumnFamilyMetrics.<init>(ColumnFamilyMetrics.java:597) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:361) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:527) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:498) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:385) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:75) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:54) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) >> ~[apache-cassandra-2.1.16.jar:2.1.16] >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> ~[na:1.8.0_151] >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> ~[na:1.8.0_151] >> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] >> >> On Thu, Aug 30, 2018 at 12:51 PM Jai Bheemsen Rao Dhanwada < >> jaibheem...@gmail.com> wrote: >> >>> thank you >>> >>> On Thu, Aug 30, 2018 at 11:58 AM Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> This is the closest JIRA that comes to mind (from memory, I didn't >>>> search, there may be others): >>>> https://issues.apache.org/jira/browse/CASSANDRA-8150 >>>> >>>> The best blog that's all in one place on tuning GC in cassandra is >>>> actually Amy's 2.1 tuning guide: >>>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html - >>>> it's somewhat out of date as it's for 2.1, but since that's what you're >>>> running, that works out in your favor. >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Aug 30, 2018 at 10:53 AM Jai Bheemsen Rao Dhanwada < >>>> jaibheem...@gmail.com> wrote: >>>> >>>>> Hi Jeff, >>>>> >>>>> Is there any JIRA that talks about increasing the HEAP will help? >>>>> Also, any other alternatives than increasing the HEAP Size? last time >>>>> when I tried increasing the heap, longer GC Pauses caused more damage in >>>>> terms of latencies while gc pause. >>>>> >>>>> On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada < >>>>> jaibheem...@gmail.com> wrote: >>>>> >>>>>> okay, thank you >>>>>> >>>>>> On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>>>> >>>>>>> You’re seeing an OOM, not a socket error / timeout. >>>>>>> >>>>>>> -- >>>>>>> Jeff Jirsa >>>>>>> >>>>>>> >>>>>>> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada < >>>>>>> jaibheem...@gmail.com> wrote: >>>>>>> >>>>>>> Jeff, >>>>>>> >>>>>>> any idea if this is somehow related to : >>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11840? >>>>>>> does increasing the value of streaming_socket_timeout_in_ms to a >>>>>>> higher value helps? >>>>>>> >>>>>>> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada < >>>>>>> jaibheem...@gmail.com> wrote: >>>>>>> >>>>>>>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I >>>>>>>> try to increase the node above 84 or so, the issue starts. >>>>>>>> >>>>>>>> I am still using CMS Heap, assuming it will create more harm if I >>>>>>>> increase the heap size beyond 8G(recommended). >>>>>>>> >>>>>>>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jji...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Given the size of your schema, you’re probably getting flooded >>>>>>>>> with a bunch of huge schema mutations as it hops into gossip and >>>>>>>>> tries to >>>>>>>>> pull the schema from every host it sees. You say 8 DCs but you don’t >>>>>>>>> say >>>>>>>>> how many nodes - I’m guessing it’s a lot? >>>>>>>>> >>>>>>>>> This is something that’s incrementally better in 3.0, but a real >>>>>>>>> proper fix has been talked about a few times - >>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and >>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example >>>>>>>>> >>>>>>>>> In the short term, you may be able to work around this by >>>>>>>>> increasing your heap size. If that doesn’t work, there’s an ugly ugly >>>>>>>>> hack >>>>>>>>> that’ll work on 2.1: limiting the number of schema blobs you can get >>>>>>>>> at a >>>>>>>>> time - in this case, that means firewall off all but a few nodes in >>>>>>>>> your >>>>>>>>> cluster for 10-30 seconds, make sure it gets the schema (watch the >>>>>>>>> logs or >>>>>>>>> file system for the tables to be created), then remove the firewall >>>>>>>>> so it >>>>>>>>> can start the bootstrap process (it needs the schema to setup the >>>>>>>>> streaming >>>>>>>>> plan, and it needs all the hosts up in gossip to stream successfully, >>>>>>>>> so >>>>>>>>> this is an ugly hack to give you time to get the schema and then heal >>>>>>>>> the >>>>>>>>> cluster so it can bootstrap). >>>>>>>>> >>>>>>>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to >>>>>>>>> make this less awful. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Jirsa >>>>>>>>> >>>>>>>>> >>>>>>>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada < >>>>>>>>> jaibheem...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> It fails before bootstrap >>>>>>>>> >>>>>>>>> streaming throughpu on the nodes is set to 400Mb/ps >>>>>>>>> >>>>>>>>> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Is the bootstrap plan succeeding (does streaming start or does it >>>>>>>>>> crash before it logs messages about streaming starting)? >>>>>>>>>> >>>>>>>>>> Have you capped the stream throughput on the existing hosts? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jeff Jirsa >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada < >>>>>>>>>> jaibheem...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Hello All, >>>>>>>>>> >>>>>>>>>> We are seeing some issue when we add more nodes to the cluster, >>>>>>>>>> where new node bootstrap is not able to stream the entire metadata >>>>>>>>>> and >>>>>>>>>> fails to bootstrap. Finally the process dies with OOM >>>>>>>>>> (java.lang.OutOfMemoryError: >>>>>>>>>> Java heap space) >>>>>>>>>> >>>>>>>>>> But if I remove few nodes from the cluster we don't see this >>>>>>>>>> issue. >>>>>>>>>> >>>>>>>>>> Cassandra Version: 2.1.16 >>>>>>>>>> # of KS and CF : 100, 3000 (approx) >>>>>>>>>> # of DC: 8 >>>>>>>>>> # of Vnodes per node: 256 >>>>>>>>>> >>>>>>>>>> Not sure what is causing this behavior, has any one come across >>>>>>>>>> this scenario? >>>>>>>>>> thanks in advance. >>>>>>>>>> >>>>>>>>>>