Regarding the "ghost IP", you may want to check the system.peers_v2
table by doing "select * from system.peers_v2 where peer =
'123.456.789.012';"
I've seen this (non-)issue many times, and I had to do "delete from
system.peers_v2 where peer=..." to fix it, as on our client side, the
Python cassandra-driver, reads the token ring information from this
table and uses it for routing requests.
On 07/06/2022 05:22, Gil Ganz wrote:
Only errors I see in the logs prior to gossip pending issue are things
like this
INFO [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833
NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel]
failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException:
finishConnect(..) failed: No route to host: /Y:7000
Caused by: java.net.ConnectException: finishConnect(..) failed: No
route to host
at
io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Remote ip mentioned here is an ip that is appearing in the seed list
(there are 20 other valid ip addresses in the seed clause), but it's
no longer a valid ip, it's an old ip of an existing server (it's not
in the peers table). I will try to reproduce the issue with this this
ip removed from seed list
On Mon, Jun 6, 2022 at 9:39 PM C. Scott Andreas <sc...@paradoxica.net>
wrote:
Hi Gil, thanks for reaching out.
Can you check Cassandra's logs to see if any uncaught exceptions
are being thrown? What you described suggests the possibility of
an uncaught exception being thrown in the Gossiper thread,
preventing further tasks from making progress; however I'm not
aware of any open issues in 4.0.4 that would result in this.
Would be eager to investigate immediately if so.
– Scott
On Jun 6, 2022, at 11:04 AM, Gil Ganz <gilg...@gmail.com> wrote:
Hey
We have a big cluster (>500 nodes, onprem, multiple datacenters,
most with vnodes=32, but some with 128), that was recently
upgraded from 3.11.9 to 4.0.4. Servers are all centos 7.
We have been dealing with a few issues related to gossip since :
1 - The moment the last node in the cluster was up with 4.0.4,
and all nodes were in the same version, gossip pending tasks
started to climb to very high numbers (>1M) in all nodes in the
cluster, and quickly the cluster was practically down. Took us a
few hours of stopping/starting up nodes, and adding more nodes to
the seed list, to finally get the cluster back up.
2 - We notice that pending gossip tasks go up to very high
numbers (50k), in random nodes in the cluster, without any
meaningful event that happened and it doesn't look like it will
go down on its own. After a few hours we restart those nodes and
it goes back to 0.
3 - Doing a rolling restart to a list of servers is now an issue,
more often then not, what will happen is one of the nodes we
restart goes up with gossip issues, and we need a 2nd restart to
get the gossip pending tasks to 0.
Is there a known issue related to gossip in big clusters, in
recent versions?
Is there any tuning that can be done?
Just to give a sense of how big the gossip information in this
cluster, "/nodetool gossipinfo/" output size is ~300kb
gil