Regarding the "ghost IP", you may want to check the system.peers_v2 table by doing "select * from system.peers_v2 where peer = '123.456.789.012';"

I've seen this (non-)issue many times, and I had to do "delete from system.peers_v2 where peer=..." to fix it, as on our client side, the Python cassandra-driver, reads the token ring information from this table and uses it for routing requests.

On 07/06/2022 05:22, Gil Ganz wrote:
Only errors I see in the logs prior to gossip pending issue are things like this

INFO  [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833 NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel] failed to connect io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: /Y:7000 Caused by: java.net.ConnectException: finishConnect(..) failed: No route to host         at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)         at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)         at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)         at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)         at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

Remote ip mentioned here is an ip that is appearing in the seed list (there are 20 other valid ip addresses in the seed clause), but it's no longer a valid ip, it's an old ip of an existing server (it's not in the peers table). I will try to reproduce the issue with this this ip removed from seed list


On Mon, Jun 6, 2022 at 9:39 PM C. Scott Andreas <sc...@paradoxica.net> wrote:

    Hi Gil, thanks for reaching out.

    Can you check Cassandra's logs to see if any uncaught exceptions
    are being thrown? What you described suggests the possibility of
    an uncaught exception being thrown in the Gossiper thread,
    preventing further tasks from making progress; however I'm not
    aware of any open issues in 4.0.4 that would result in this.

    Would be eager to investigate immediately if so.

    – Scott

    On Jun 6, 2022, at 11:04 AM, Gil Ganz <gilg...@gmail.com> wrote:


    Hey
    We have a big cluster (>500 nodes, onprem, multiple datacenters,
    most with vnodes=32, but some with 128), that was recently
    upgraded from 3.11.9 to 4.0.4. Servers are all centos 7.

    We have been dealing with a few issues related to gossip since :
    1 - The moment the last node in the cluster was up with 4.0.4,
    and all nodes were in the same version, gossip pending tasks
    started to climb to very high numbers (>1M) in all nodes in the
    cluster, and quickly the cluster was practically down. Took us a
    few hours of stopping/starting up nodes, and adding more nodes to
    the seed list, to finally get the cluster back up.
    2 - We notice that pending gossip tasks go up to very high
    numbers (50k), in random nodes in the cluster, without any
    meaningful event that happened and it doesn't look like it will
    go down on its own. After a few hours we restart those nodes and
    it goes back to 0.
    3 - Doing a rolling restart to a list of servers is now an issue,
    more often then not, what will happen is one of the nodes we
    restart goes up with gossip issues, and we need a 2nd restart to
    get the gossip pending tasks to 0.

    Is there a known issue related to gossip in big clusters, in
    recent versions?
    Is there any tuning that can be done?

    Just to give a sense of how big the gossip information in this
    cluster, "/nodetool gossipinfo/" output size is ~300kb

    gil

Reply via email to