Good luck

Thanks & Best Regards

Henry PAN
Sr. Lead Cloud Architect
(425) 802--3975
https://www.linkedin.com/in/henrypan1



On Wed, Jan 21, 2026 at 3:33 PM Ashaman Kingpin <[email protected]>
wrote:

> Thanks Henry and Caleb. I will create the Jira ticket tomorrow.
>
> On Jan 21, 2026, at 5:34 PM, Caleb Rackliffe <[email protected]>
> wrote:
>
> 
> One other thing that might be interesting to clarify is whether this
> occurs on a rolling restart as well as a complete bring-down/bring-up.
>
> On Wed, Jan 21, 2026 at 3:54 PM Caleb Rackliffe <[email protected]>
> wrote:
>
>> Hi Ashaman,
>>
>> Would you be willing to create a Jira with a full description of the
>> problem and assign it to me? (Caleb Rackliffe/maedhroz)
>>
>> Thanks!
>>
>> On Wed, Jan 21, 2026 at 9:14 AM Henry Pan (HP) <[email protected]>
>> wrote:
>>
>>> You’ve written a *very* solid technical analysis already, and what
>>> you’re describing *is not expected behavior in Cassandra 5.0.x*. Based
>>> on the symptoms, stack trace, and the version‑gating logic you inspected,
>>> what you’re hitting is almost certainly a *real bug* in the 5.0.x SAI
>>> gossip serialization path — specifically in the *pre‑convergence
>>> version detection logic*.
>>>
>>> Let me walk you through what’s actually happening, why your cluster gets
>>> stuck, and what the Cassandra community would typically recommend.
>>>
>>> 🧠 *What’s Actually Going Wrong (Root Cause)*
>>>
>>> *1. Cassandra 5.0.3 introduced a fix (CASSANDRA‑20058)*
>>>
>>> That fix added a *compressed SAI index‑status gossip format* to avoid
>>> blowing up the gossip payload size.
>>>
>>> *2. But the fix is gated on minVersion*
>>>
>>> The new format is only used when:
>>>
>>>    - Gossip has converged
>>>    - All nodes report version ≥ 5.0.3
>>>
>>> *3. During startup, minVersion is null*
>>>
>>> This is the key problem.
>>>
>>> On a cold restart:
>>>
>>>    - Nodes have *no* gossip state yet
>>>    - minVersion is treated as *unknown*
>>>    - Cassandra falls back to the *legacy (pre‑5.0.3) format*
>>>    - That format serializes *all* SAI index metadata uncompressed
>>>    - With many keyspaces/tables/indexes, the payload becomes huge
>>>    - TypeSizes.sizeof() asserts because the payload exceeds the
>>>    expected bounds
>>>
>>> *4. The node never joins gossip → gossip never converges → new format
>>> never activates*
>>>
>>> This creates a *deadlock*:
>>>
>>> Needs gossip convergence → to use compact format Needs compact format →
>>> to join gossip
>>>
>>> This is exactly why only the *first* node starts — it has no peers to
>>> gossip with, so it never receives a large payload.
>>>
>>> 🟠 *Is this expected behavior in 5.0.x?*
>>>
>>> *No.*
>>> This is not expected, and your analysis is correct: the version‑gating
>>> logic is flawed for clusters with large SAI metadata sets.
>>>
>>> The fix in 5.0.3 solved the *symptom* (payload size) but not the *bootstrap
>>> ordering problem*.
>>>
>>> 🔴 *Is this a known limitation or bug?*
>>>
>>> It is *not widely documented*, but it *is* a bug.
>>>
>>> You are essentially hitting a *gossip‑serialization deadlock* caused by:
>>>
>>>    - Large SAI metadata
>>>    - Legacy serialization path
>>>    - Version detection not initialized early enough
>>>
>>> Your observation that the fix is “incomplete” is accurate — the gating
>>> logic should not depend on gossip convergence during bootstrap.
>>>
>>> This should be escalated upstream.
>>>
>>> 🟢 *Is your workaround valid?*
>>>
>>> Yes — forcing the compact format:
>>>
>>> return false;
>>>
>>> …is the correct behavior for any *homogeneous 5.0.4+ cluster*.
>>>
>>> Your patch proves the issue is not data corruption, but *incorrect
>>> version‑gating logic*.
>>>
>>> 🧭 *Recommended Workarounds (Until Upstream Fix)*
>>>
>>> *1. Force compact format during startup (your patch)*
>>>
>>> This is the cleanest workaround for now.
>>>
>>> *2. Start nodes one at a time with gossip disabled*
>>>
>>> Not ideal, but possible:
>>>
>>> JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"
>>>
>>> Then:
>>>
>>>    1. Start node
>>>    2. Wait for it to settle
>>>    3. Enable join
>>>    4. Repeat
>>>
>>> This avoids large gossip payloads during initial handshake.
>>>
>>> *3. Reduce SAI index metadata temporarily*
>>>
>>> If possible:
>>>
>>>    - Drop unused SAI indexes
>>>    - Reduce index count per table
>>>    - Restart cluster
>>>    - Recreate indexes
>>>
>>> Not ideal, but works in emergency situations.
>>>
>>> *4. Patch the version‑gating logic upstream*
>>>
>>> The correct fix would be:
>>>
>>>    - If all nodes are ≥ 5.0.3 *by configuration*, use compact format
>>>    immediately
>>>    - Do not rely on gossip convergence to activate the new format
>>>
>>> This is exactly what your patch does.
>>>
>>> 🧩 *Why this only happens on restart (not initial provisioning)*
>>>
>>> During initial provisioning:
>>>
>>>    - Nodes join one by one
>>>    - Gossip state is small
>>>    - SAI metadata is not yet fully populated
>>>    - No large payloads are exchanged
>>>
>>> After provisioning:
>>>
>>>    - SAI metadata grows
>>>    - Gossip payloads grow
>>>    - Legacy format becomes too large
>>>    - Restart triggers the deadlock
>>>
>>> This is consistent with your observations.
>>>
>>> 🟦 *What you should do next*
>>>
>>> If you want this fixed upstream, you should:
>>>
>>>    - Open a JIRA ticket referencing CASSANDRA‑20058
>>>    - Include your analysis + stack trace
>>>    - Include your patch
>>>    - Describe the bootstrap deadlock scenario
>>>
>>> This is the kind of issue the Cassandra dev team will want to fix before
>>> 5.0.x becomes widely adopted.
>>>
>>> 🎯 *Short Answers to Your Questions*
>>>
>>> *Is this startup/version‑gating behavior expected?*
>>>
>>> No — this is unintended behavior.
>>>
>>> *Is this a known limitation or bug?*
>>>
>>> It is a bug. The fix in 5.0.3 is incomplete for large SAI metadata
>>> clusters.
>>>
>>> *Is there a recommended way to bootstrap or restart clusters in this
>>> state?*
>>>
>>> Workarounds include:
>>>
>>>    - Forcing compact format (best)
>>>    - Staggered startup with join_ring=false
>>>    - Reducing SAI metadata temporarily
>>>
>>> If you want, I can help you:
>>>
>>>    - Draft the JIRA ticket
>>>    - Write a minimal reproducible test case
>>>    - Produce a clean patch proposal
>>>    - Review the relevant Cassandra code paths with you
>>>
>>> Just tell me how deep you want to go.
>>>
>>> Thanks & Best Regards
>>>
>>> Henry PAN
>>> Sr. Lead Cloud Architect
>>> (425) 802--3975
>>> https://www.linkedin.com/in/henrypan1
>>>
>>>
>>>
>>> On Wed, Jan 21, 2026 at 7:07 AM Ashaman Kingpin <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I’m looking for some guidance on a Cassandra 5.0.x startup issue we’re
>>>> seeing and wanted to ask the user list if this behavior is expected or
>>>> already known.
>>>>
>>>> We’re running a homogeneous 5.0.4 (also tested with 5.0.6) cluster with
>>>> a relatively large number of keyspaces, tables, and SAI indexes. On initial
>>>> cluster creation and provisioning of multiple keyspaces, everything
>>>> operates as expected. However, after stopping the cluster and restarting
>>>> all nodes, only the first node comes up successfully. Subsequent nodes fail
>>>> during startup with an assertion in the gossip thread while serializing the
>>>> SAI index status metadata.
>>>>
>>>> ERROR [GossipStage:1] 2025-12-22 17:20:10,365 
>>>> JVMStabilityInspector.java:70 - Exception in thread 
>>>> Thread[GossipStage:1,5,GossipStage]
>>>> java.lang.RuntimeException: java.lang.AssertionError
>>>>         at 
>>>> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:108)
>>>>         at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
>>>>         at 
>>>> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
>>>>         at 
>>>> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
>>>>         at 
>>>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>>>         at 
>>>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>>>         at 
>>>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>>>         at java.base/java.lang.Thread.run(Thread.java:834)
>>>> Caused by: java.lang.AssertionError: null
>>>>         at org.apache.cassandra.db.TypeSizes.sizeof(TypeSizes.java:44)
>>>>         at 
>>>> org.apache.cassandra.gms.VersionedValue$VersionedValueSerializer.serializedSize(VersionedValue.java:381)
>>>>         at 
>>>> org.apache.cassandra.gms.VersionedValue$VersionedValueSerializer.serializedSize(VersionedValue.java:359)
>>>>         at 
>>>> org.apache.cassandra.gms.EndpointStateSerializer.serializedSize(EndpointState.java:344)
>>>>         at 
>>>> org.apache.cassandra.gms.EndpointStateSerializer.serializedSize(EndpointState.java:300)
>>>>         at 
>>>> org.apache.cassandra.gms.GossipDigestAckSerializer.serializedSize(GossipDigestAck.java:96)
>>>>         at 
>>>> org.apache.cassandra.gms.GossipDigestAckSerializer.serializedSize(GossipDigestAck.java:61)
>>>>         at 
>>>> org.apache.cassandra.net.Message$Serializer.payloadSize(Message.java:1088)
>>>>         at org.apache.cassandra.net.Message.payloadSize(Message.java:1131)
>>>>         at 
>>>> org.apache.cassandra.net.Message$Serializer.serializedSize(Message.java:769)
>>>>
>>>> It seems there was a fix to this same issue as reported in this DBA
>>>> Stack Exchange post
>>>> <https://dba.stackexchange.com/questions/343389/schema-changes-on-5-0-result-in-gossip-failures-o-a-c-db-db-typesizes-sizeof>
>>>>  ((CASSANDRA-20058
>>>> <https://issues.apache.org/jira/browse/CASSANDRA-20058>).   It seems
>>>> to me though that the fix described in that post and ticket, included in
>>>> Cassandra 5.0.3, is incomplete?  From what I can tell, the fix seems to
>>>> only be activated once the gossip state of the cluster has converged but
>>>> the error seems to occur before this happens.  At the point of the error,
>>>> the minimum cluster version appears to be treated as unknown, which causes
>>>> Cassandra to fall back to the legacy (pre-5.0.3) index-status serialization
>>>> format. In our case, that legacy representation becomes large enough to
>>>> trigger the assertion, preventing the node from joining. Because the node
>>>> never joins, gossip never converges, and the newer 5.0.3+ compressed format
>>>> is never enabled.
>>>>
>>>> This effectively leaves the cluster stuck in a startup loop where only
>>>> the first node can come up.
>>>>
>>>> As a sanity check, I locally modified the version-gating logic in
>>>> *IndexStatusManager.java *for the index-status serialization to always
>>>> use the newer compact format during startup, and with that change the
>>>> cluster started successfully.
>>>>
>>>> private static boolean shouldWriteLegacyStatusFormat(CassandraVersion 
>>>> minVersion)
>>>>     {
>>>>         return false; // return minVersion == null || (minVersion.major == 
>>>> 5 && minVersion.minor == 0 && minVersion.patch < 3);
>>>>     }
>>>>
>>>> This makes me suspect the issue is related to bootstrap ordering or
>>>> version detection rather than data corruption or configuration.
>>>>
>>>> I posted a more detailed write-up
>>>> <https://dba.stackexchange.com/questions/349488/cassandra-5-0-4-startup-deadlock-gossip-uses-pre-5-0-3-encoding-due-to-version>
>>>>  (with
>>>> stack traces and code references) on DBA StackExchange a few weeks ago but
>>>> haven’t received any feedback yet, so I wanted to ask here:
>>>>
>>>>
>>>>    -
>>>>
>>>>    Is this startup/version-gating behavior expected in 5.0.x?
>>>>    -
>>>>
>>>>    Is this a known limitation or bug?
>>>>    -
>>>>
>>>>    Is there a recommended way to bootstrap or restart clusters in this
>>>>    state?
>>>>
>>>> Any insight would be appreciated. Happy to provide logs or additional
>>>> details if helpful.
>>>>
>>>> Thanks,
>>>>
>>>> Nicholas
>>>>
>>>

Reply via email to