Got just enough time to look at this done today to verify that:
Sometimes nodes (under pressure) fails to send heartbeats for long
enough to get marked as dead by other nodes (why is a good question,
which I need to check better. Does not seem to be GC).
The node does however start sending
I bet the problem is with the other tasks on the executor that Gossip
heartbeat runs on.
I see at least two that could cause blocking: hint cleanup
post-delivery and flush-expired-memtables, both of which call
forceFlush which will block if the flush queue + threads are full.
We've run into this
World as seen from .81 in the below ring
.81 Up Normal 85.55 GB8.33% Token(bytes[30])
.82 Down Normal 83.23 GB8.33% Token(bytes[313230])
.83 Up Normal 70.43 GB8.33% Token(bytes[313437])
.84 Up Normal 81.7 GB 8.33%