You’ve written a very solid technical analysis already, and what you’re describing is not expected behavior in Cassandra 5.0.x. Based on the symptoms, stack trace, and the version‑gating logic you inspected, what you’re hitting is almost certainly a real bug in the 5.0.x SAI gossip serialization path — specifically in the pre‑convergence version detection logic.
Let me walk you through what’s actually happening, why your cluster gets stuck, and what the Cassandra community would typically recommend.
🧠 What’s Actually Going Wrong (Root Cause)
1. Cassandra 5.0.3 introduced a fix (CASSANDRA‑20058)
That fix added a compressed SAI index‑status gossip format to avoid blowing up the gossip payload size.
2. But the fix is gated on minVersion
The new format is only used when:
- Gossip has converged
- All nodes report version ≥ 5.0.3
3. During startup, minVersion is null
This is the key problem.
On a cold restart:
- Nodes have no gossip state yet
minVersion is treated as unknown
- Cassandra falls back to the legacy (pre‑5.0.3) format
- That format serializes all SAI index metadata uncompressed
- With many keyspaces/tables/indexes, the payload becomes huge
TypeSizes.sizeof() asserts because the payload exceeds the expected bounds
4. The node never joins gossip → gossip never converges → new format never activates
This creates a deadlock:
Needs gossip convergence → to use compact format
Needs compact format → to join gossip
This is exactly why only the first node starts — it has no peers to gossip with, so it never receives a large payload.
🟠 Is this expected behavior in 5.0.x?
No.
This is not expected, and your analysis is correct: the version‑gating logic is flawed for clusters with large SAI metadata sets.
The fix in 5.0.3 solved the symptom (payload size) but not the bootstrap ordering problem.
🔴 Is this a known limitation or bug?
It is not widely documented, but it is a bug.
You are essentially hitting a gossip‑serialization deadlock caused by:
- Large SAI metadata
- Legacy serialization path
- Version detection not initialized early enough
Your observation that the fix is “incomplete” is accurate — the gating logic should not depend on gossip convergence during bootstrap.
This should be escalated upstream.
🟢 Is your workaround valid?
Yes — forcing the compact format:
return false;
…is the correct behavior for any homogeneous 5.0.4+ cluster.
Your patch proves the issue is not data corruption, but incorrect version‑gating logic.
🧭 Recommended Workarounds (Until Upstream Fix)
1. Force compact format during startup (your patch)
This is the cleanest workaround for now.
2. Start nodes one at a time with gossip disabled
Not ideal, but possible:
JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"
Then:
- Start node
- Wait for it to settle
- Enable join
- Repeat
This avoids large gossip payloads during initial handshake.
3. Reduce SAI index metadata temporarily
If possible:
- Drop unused SAI indexes
- Reduce index count per table
- Restart cluster
- Recreate indexes
Not ideal, but works in emergency situations.
4. Patch the version‑gating logic upstream
The correct fix would be:
- If all nodes are ≥ 5.0.3 by configuration, use compact format immediately
- Do not rely on gossip convergence to activate the new format
This is exactly what your patch does.
🧩 Why this only happens on restart (not initial provisioning)
During initial provisioning:
- Nodes join one by one
- Gossip state is small
- SAI metadata is not yet fully populated
- No large payloads are exchanged
After provisioning:
- SAI metadata grows
- Gossip payloads grow
- Legacy format becomes too large
- Restart triggers the deadlock
This is consistent with your observations.
🟦 What you should do next
If you want this fixed upstream, you should:
- Open a JIRA ticket referencing CASSANDRA‑20058
- Include your analysis + stack trace
- Include your patch
- Describe the bootstrap deadlock scenario
This is the kind of issue the Cassandra dev team will want to fix before 5.0.x becomes widely adopted.
🎯 Short Answers to Your Questions
Is this startup/version‑gating behavior expected?
No — this is unintended behavior.
Is this a known limitation or bug?
It is a bug. The fix in 5.0.3 is incomplete for large SAI metadata clusters.
Is there a recommended way to bootstrap or restart clusters in this state?
Workarounds include:
- Forcing compact format (best)
- Staggered startup with join_ring=false
- Reducing SAI metadata temporarily
If you want, I can help you:
- Draft the JIRA ticket
- Write a minimal reproducible test case
- Produce a clean patch proposal
- Review the relevant Cassandra code paths with you
Just tell me how deep you want to go.