On 3.11, fat client timeout is QUARANTINE_DELAY / 2 : https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/gms/Gossiper.java#L260
Quarantine delay is StorageService.RING_DELAY * 2; https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/gms/Gossiper.java#L105 Ring delay is: private static int getRingDelay() { String newdelay = System.getProperty("cassandra.ring_delay_ms"); if (newdelay != null) { logger.info("Overriding RING_DELAY to {}ms", newdelay); return Integer.parseInt(newdelay); } else return 30 * 1000; } So, 30s or -Dcassandra.ring_delay_ms= on the command line, but note that this ALSO impacts normal startup/shutdown/expand/shrink/etc type operations, and if you have to ask how to change it, you probably shouldn't. - Jeff On Wed, Mar 17, 2021 at 8:32 AM Regis Le Bretonnic < r.lebreton...@meetic-corp.com> wrote: > Hi Jeff > > Thank a lot for your answer. > The reference to "fat client" is very interesting… On debug log on > classical node, we have sometimes message like : > > > > INFO [GossipTasks:1] 2021-03-17 16:21:01,135 Gossiper.java:894 - > FatClient /10.120.1.183 has been silent for 30000ms, removing from gossip > > > > Does it means that the fat client (our proxies) is removed from gossip, > only after 30 seconds ? In such case, the delay I ask for is 30 sec :-) > Does someone know if this parameter can be changed ? > > PS : yes proxies work really well… we indeed use PHP with FPM. That the > reason why we have a lot of connections and so need proxies. Basically if > counting all FPM on all our PHP servers, I’d said 8000 to 10000 clients… > maybe more. > Advantages are multiple but basically we had a lot of pressure on > Cassandra node when restarting all our PHP servers during a new code > rollout requiring PHP reload for instance (many times per day). Proxies > saved us. We can continue to talk in private if you want. > > > > *De :* Jeff Jirsa <jji...@gmail.com> > *Envoyé :* mercredi 17 mars 2021 14:52 > *À :* cassandra <user@cassandra.apache.org> > *Objet :* Re: Delay between stop/start cassandra > > > > -Dcassandra.join_ring=false is basically a pre-bootstrap phase that says > "this machine is about to join the cluster, but hasn't yet, so don't give > it a token" > > > > It's taking advantage of a stable but non-terminal state to let you do > things like serve queries without owning data - it's a side effect that > works, but it's rough because it wasn't exactly built for this purpose. In > this state, you're considered a "fat client" - your presence exists in the > ring as a "I'm here, about to join the ring with IP a.b.c.d", and you just > conveniently decide not to join the ring. If you go away at any time, the > cluster says "cool, no big deal, they didn't join the ring anyway". > > > > Your hypothesis is probably mostly right here - it's not so much UP or > DOWN, it's "still here" or "gone". Because once the instance is DOWN, it > gets removed because it hadn't finished joining. Once it's removed, it can > come back and say "Hi, me again, about to join this cluster". But, until > it's removed as a fat client, when it comes back and says "Hi, me again, > about to join this cluster", cassandra says "not so fast friend, you're > already here and we haven't yet given up on you joining". > > > > Random aside: There are relatively few people on earth who run like this, > so I'm super interested in knowing how it's working for you. Does the PHP > client still reconnect on every page load, or does it finally support long > lived connections / pooling if you're using something like php-fpm or a > fastcgi pool? Are the coordinators/proxies here just to handle a ridiculous > number of clients, or is it the cost of connecting that's hurting as you > blow up the native thread pool on connect for expensive auth? > > > > > > > > > > > > On Wed, Mar 17, 2021 at 5:44 AM Regis Le Bretonnic < > r.lebreton...@meetic-corp.com> wrote: > > Hi all, > > > > Following a discussion with our adminsys, I have a very practical question. > > We use cassandra proxies (-Dcassandra.join_ring=false) as coordinators for > PHP clients (a loooooot of PHP clients). > > > > Our problem is that restarting Cassandra on proxies sometimes fails with > the following error : > > > > ERROR [main] 2021-03-16 14:18:46,236 CassandraDaemon.java:803 - Exception > encountered during startup > java.lang.RuntimeException: A node with address > XXXXXXXXXXXXXXXX/10.120.1.XXX already exists, cancelling join. Use > cassandra.replace_address if you want to replace this node. > > > > The node mentioned in the ERROR is the one we are restarting… and the > start fails. Of course doing a manual start after works fine. > > This message doesn’t make sense… hostId didn’t changed for this proxy (I > am sure of me : system.local, IP, hostname, … nothing changed… just the > restart). > > > > What I suppose (we don’t all agree about this) is that, as proxies don’t > have data, they start very quickly. Too quickly for gossip protocol knows > that the node was down. > > Could this ERROR log be explained if the node is still known as UP by > seeds servers if the state of the proxy in gossip protocol is not updated > because stop/start is made too quickly ? > > If this hypothesis seems possible, what reasonable delay (with technical > arguments) should be implemented between stop and start ? > We have ~ 100 proxies and 12 classical Cassandra (4 of them are seeds)… > > Thx in advance > > > >