If you’re correct that the issue you linked to is the bug you are hitting, then 
it was fixed in 3.11.3.  You may have no choice but to upgrade.  From the 
discussion it doesn’t read as if any tuning tweaks avoided the issue, just the 
patch fixed it.

If you do, I’d suggest going to at least 3.11.5.

Note that usable memory for a setting > 31 gb may not be what you think. At 
32gb you cross a boundary that triggers object pointers to double in size.  The 
only way you really win is when an app has only a modest number of objects, but 
some of those objects have large non-object-granularity allocations, e.g. like 
a few huge byte arrays.  C* does use some large buffers, but it also generates 
a lot of small objects.

I’d consider TCP tunings a likely red herring in this, if you are correct about 
the leak.  Doesn’t mean you can’t have better settings per suggestions made, 
just that it seems like it could be a case of refining behavior on the 
periphery of the problem, not anything directly addressing it.


From: Surbhi Gupta <surbhi.gupt...@gmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Saturday, May 9, 2020 at 11:51 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Bootstraping is failing

Message from External Sender
I tried to change the heap size from 31GB to 62GB on the bootstrapping node 
because , I noticed that , when it reached the mid way of bootstrapping , heap 
reached to around 90% or more and node just freeze .
But still it is the same behavior , it again reached midway and heap again 
reached 90% or more and node just freeze and none of the node tool command 
returns the output, other node also removed this node from the joining as they 
were not able to gossip.
We are on 3.11.0 .

I tried to take heap dump when the node had 90% + heap utilization of 62GB heap 
size and opened the leak report and found 3 leak suspect and out of three 2 
were as below:

1. The thread io.netty.util.concurrent.FastThreadLocalThread @ 0x7fbe9533bf98 
StreamReceiveTask:26 keeps local variables with total size 16,898,023,552 
(31.10%)bytes.
The memory is accumulated in one instance of 
"io.netty.util.Recycler$DefaultHandle[]" loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x7fb917c76dc8".

2. The thread io.netty.util.concurrent.FastThreadLocalThread @ 0x7fbb846fb800 
StreamReceiveTask:29 keeps local variables with total size 11,696,214,424 
(21.53%)bytes.
The memory is accumulated in one instance of 
"io.netty.util.Recycler$DefaultHandle[]" loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x7fb917c76dc8".

Am I getting hit by 
https://issues.apache.org/jira/browse/CASSANDRA-13929<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D13929&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=LYdnPGldpP4IB6pDevPWk1Scr0tsTaFmsqx5uslKvCo&s=P-0rAJBdSvwhDOArjtaJ1LvgjJ56dTlvzIEcBZGbo8Y&e=>

I haven't changed the tcp settings . My tcp settings are more than recommended, 
what I wanted to understand , how tcp settings can effect the bootstrapping 
process ?

Thanks
Surbhi

On Thu, 7 May 2020 at 17:01, Surbhi Gupta 
<surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote:
When we are starting the node, it is starting bootstrap automatically and 
restreaming the whole data again.  It is not resuming .

On Thu, May 7, 2020 at 4:47 PM Adam Scott 
<adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote:
I think you want to run `nodetool bootstrap resume` 
(https://cassandra.apache.org/doc/latest/tools/nodetool/bootstrap.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__cassandra.apache.org_doc_latest_tools_nodetool_bootstrap.html&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=LYdnPGldpP4IB6pDevPWk1Scr0tsTaFmsqx5uslKvCo&s=hQxh8KK3IQK5yln8hl6kjyHW6bJlzCQMxzHhy3E6zYU&e=>)
  to pick up where it last left off. Sorry for the late reply.


On Thu, May 7, 2020 at 2:22 PM Surbhi Gupta 
<surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote:
So after failed bootstrapped , if we start cassandra again on the new node , 
will it resume bootstrap or will it start over?

On Thu, 7 May 2020 at 13:32, Adam Scott 
<adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote:
I recommend it on all nodes.  This will eliminate that as a source of trouble 
further on down the road.


On Thu, May 7, 2020 at 1:30 PM Surbhi Gupta 
<surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote:
streaming_socket_timeout_in_ms is 24 hour.
  So tcp settings should be changed on the new bootstrap node or on all nodes ?


On Thu, 7 May 2020 at 13:23, Adam Scott 
<adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote:

edit /etc/sysctl.conf


net.ipv4.tcp_keepalive_time=60

net.ipv4.tcp_keepalive_probes=3

net.ipv4.tcp_keepalive_intvl=10
then run sysctl -p to cause the kernel to reload the settings

5 minutes (300) seconds is probably too long.

On Thu, May 7, 2020 at 1:09 PM Surbhi Gupta 
<surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote:

[root@abc cassandra]# cat /proc/sys/net/ipv4/tcp_keepalive_time

300

[root@abc cassandra]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

30

[root@abc cassandra]# cat /proc/sys/net/ipv4/tcp_keepalive_probes

9

On Thu, 7 May 2020 at 12:32, Adam Scott 
<adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote:
Maybe a firewall killing a connection?

What does the following show?
cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_intvl
cat /proc/sys/net/ipv4/tcp_keepalive_probes

On Thu, May 7, 2020 at 10:31 AM Surbhi Gupta 
<surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote:
Hi,

We are trying to expand a datacenter and trying to add nodes but when node is 
bootstrapping , it goes half way through and then fail with below error, We 
have increased stremthroughput from 200 to 400 when we were trying for the 2nd 
time but still it failed. We are on 3.11.0 , using G1GC with 31GB heap.


ERROR [MessagingService-Incoming-/10.X.X.X] 2020-05-07 09:42:38,933 
CassandraDaemon.java:228 - Exception in thread 
Thread[MessagingService-Incoming-/10.X.X.X,main]

java.io.IOError: java.io.EOFException: Stream ended prematurely

        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:227)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:215)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:839)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:814)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

Caused by: java.io.EOFException: Stream ended prematurely

        at 
net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:218) 
~[lz4-1.3.0.jar:na]

        at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:150) 
~[lz4-1.3.0.jar:na]

        at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:117) 
~[lz4-1.3.0.jar:na]

        at java.io.DataInputStream.readFully(DataInputStream.java:195) 
~[na:1.8.0_242]

        at java.io.DataInputStream.readFully(DataInputStream.java:169) 
~[na:1.8.0_242]

        at 
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:402) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.marshal.AbstractType.readValue(AbstractType.java:437) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:245) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.readComplexColumn(UnfilteredSerializer.java:665)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$deserializeRowBody$1(UnfilteredSerializer.java:606)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1242) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1197) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.db.Columns.apply(Columns.java:377) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody(UnfilteredSerializer.java:600)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeOne(UnfilteredSerializer.java:475)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:431)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:222)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        ... 11 common frames omitted

Thanks
Surbhi

Reply via email to