Ticket created and assigned, thanks!

https://issues.apache.org/jira/browse/CASSANDRA-21132

On Wed, Jan 21, 2026 at 8:15 PM Henry Pan (HP) <[email protected]> wrote:

> Good luck
>
> Thanks & Best Regards
>
> Henry PAN
> Sr. Lead Cloud Architect
> (425) 802--3975
> https://www.linkedin.com/in/henrypan1
>
>
>
> On Wed, Jan 21, 2026 at 3:33 PM Ashaman Kingpin <[email protected]>
> wrote:
>
>> Thanks Henry and Caleb. I will create the Jira ticket tomorrow.
>>
>> On Jan 21, 2026, at 5:34 PM, Caleb Rackliffe <[email protected]>
>> wrote:
>>
>> 
>> One other thing that might be interesting to clarify is whether this
>> occurs on a rolling restart as well as a complete bring-down/bring-up.
>>
>> On Wed, Jan 21, 2026 at 3:54 PM Caleb Rackliffe <[email protected]>
>> wrote:
>>
>>> Hi Ashaman,
>>>
>>> Would you be willing to create a Jira with a full description of the
>>> problem and assign it to me? (Caleb Rackliffe/maedhroz)
>>>
>>> Thanks!
>>>
>>> On Wed, Jan 21, 2026 at 9:14 AM Henry Pan (HP) <[email protected]>
>>> wrote:
>>>
>>>> You’ve written a *very* solid technical analysis already, and what
>>>> you’re describing *is not expected behavior in Cassandra 5.0.x*. Based
>>>> on the symptoms, stack trace, and the version‑gating logic you inspected,
>>>> what you’re hitting is almost certainly a *real bug* in the 5.0.x SAI
>>>> gossip serialization path — specifically in the *pre‑convergence
>>>> version detection logic*.
>>>>
>>>> Let me walk you through what’s actually happening, why your cluster
>>>> gets stuck, and what the Cassandra community would typically recommend.
>>>>
>>>> 🧠 *What’s Actually Going Wrong (Root Cause)*
>>>>
>>>> *1. Cassandra 5.0.3 introduced a fix (CASSANDRA‑20058)*
>>>>
>>>> That fix added a *compressed SAI index‑status gossip format* to avoid
>>>> blowing up the gossip payload size.
>>>>
>>>> *2. But the fix is gated on minVersion*
>>>>
>>>> The new format is only used when:
>>>>
>>>>    - Gossip has converged
>>>>    - All nodes report version ≥ 5.0.3
>>>>
>>>> *3. During startup, minVersion is null*
>>>>
>>>> This is the key problem.
>>>>
>>>> On a cold restart:
>>>>
>>>>    - Nodes have *no* gossip state yet
>>>>    - minVersion is treated as *unknown*
>>>>    - Cassandra falls back to the *legacy (pre‑5.0.3) format*
>>>>    - That format serializes *all* SAI index metadata uncompressed
>>>>    - With many keyspaces/tables/indexes, the payload becomes huge
>>>>    - TypeSizes.sizeof() asserts because the payload exceeds the
>>>>    expected bounds
>>>>
>>>> *4. The node never joins gossip → gossip never converges → new format
>>>> never activates*
>>>>
>>>> This creates a *deadlock*:
>>>>
>>>> Needs gossip convergence → to use compact format Needs compact format →
>>>> to join gossip
>>>>
>>>> This is exactly why only the *first* node starts — it has no peers to
>>>> gossip with, so it never receives a large payload.
>>>>
>>>> 🟠 *Is this expected behavior in 5.0.x?*
>>>>
>>>> *No.*
>>>> This is not expected, and your analysis is correct: the version‑gating
>>>> logic is flawed for clusters with large SAI metadata sets.
>>>>
>>>> The fix in 5.0.3 solved the *symptom* (payload size) but not the *bootstrap
>>>> ordering problem*.
>>>>
>>>> 🔴 *Is this a known limitation or bug?*
>>>>
>>>> It is *not widely documented*, but it *is* a bug.
>>>>
>>>> You are essentially hitting a *gossip‑serialization deadlock* caused
>>>> by:
>>>>
>>>>    - Large SAI metadata
>>>>    - Legacy serialization path
>>>>    - Version detection not initialized early enough
>>>>
>>>> Your observation that the fix is “incomplete” is accurate — the gating
>>>> logic should not depend on gossip convergence during bootstrap.
>>>>
>>>> This should be escalated upstream.
>>>>
>>>> 🟢 *Is your workaround valid?*
>>>>
>>>> Yes — forcing the compact format:
>>>>
>>>> return false;
>>>>
>>>> …is the correct behavior for any *homogeneous 5.0.4+ cluster*.
>>>>
>>>> Your patch proves the issue is not data corruption, but *incorrect
>>>> version‑gating logic*.
>>>>
>>>> 🧭 *Recommended Workarounds (Until Upstream Fix)*
>>>>
>>>> *1. Force compact format during startup (your patch)*
>>>>
>>>> This is the cleanest workaround for now.
>>>>
>>>> *2. Start nodes one at a time with gossip disabled*
>>>>
>>>> Not ideal, but possible:
>>>>
>>>> JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"
>>>>
>>>> Then:
>>>>
>>>>    1. Start node
>>>>    2. Wait for it to settle
>>>>    3. Enable join
>>>>    4. Repeat
>>>>
>>>> This avoids large gossip payloads during initial handshake.
>>>>
>>>> *3. Reduce SAI index metadata temporarily*
>>>>
>>>> If possible:
>>>>
>>>>    - Drop unused SAI indexes
>>>>    - Reduce index count per table
>>>>    - Restart cluster
>>>>    - Recreate indexes
>>>>
>>>> Not ideal, but works in emergency situations.
>>>>
>>>> *4. Patch the version‑gating logic upstream*
>>>>
>>>> The correct fix would be:
>>>>
>>>>    - If all nodes are ≥ 5.0.3 *by configuration*, use compact format
>>>>    immediately
>>>>    - Do not rely on gossip convergence to activate the new format
>>>>
>>>> This is exactly what your patch does.
>>>>
>>>> 🧩 *Why this only happens on restart (not initial provisioning)*
>>>>
>>>> During initial provisioning:
>>>>
>>>>    - Nodes join one by one
>>>>    - Gossip state is small
>>>>    - SAI metadata is not yet fully populated
>>>>    - No large payloads are exchanged
>>>>
>>>> After provisioning:
>>>>
>>>>    - SAI metadata grows
>>>>    - Gossip payloads grow
>>>>    - Legacy format becomes too large
>>>>    - Restart triggers the deadlock
>>>>
>>>> This is consistent with your observations.
>>>>
>>>> 🟦 *What you should do next*
>>>>
>>>> If you want this fixed upstream, you should:
>>>>
>>>>    - Open a JIRA ticket referencing CASSANDRA‑20058
>>>>    - Include your analysis + stack trace
>>>>    - Include your patch
>>>>    - Describe the bootstrap deadlock scenario
>>>>
>>>> This is the kind of issue the Cassandra dev team will want to fix
>>>> before 5.0.x becomes widely adopted.
>>>>
>>>> 🎯 *Short Answers to Your Questions*
>>>>
>>>> *Is this startup/version‑gating behavior expected?*
>>>>
>>>> No — this is unintended behavior.
>>>>
>>>> *Is this a known limitation or bug?*
>>>>
>>>> It is a bug. The fix in 5.0.3 is incomplete for large SAI metadata
>>>> clusters.
>>>>
>>>> *Is there a recommended way to bootstrap or restart clusters in this
>>>> state?*
>>>>
>>>> Workarounds include:
>>>>
>>>>    - Forcing compact format (best)
>>>>    - Staggered startup with join_ring=false
>>>>    - Reducing SAI metadata temporarily
>>>>
>>>> If you want, I can help you:
>>>>
>>>>    - Draft the JIRA ticket
>>>>    - Write a minimal reproducible test case
>>>>    - Produce a clean patch proposal
>>>>    - Review the relevant Cassandra code paths with you
>>>>
>>>> Just tell me how deep you want to go.
>>>>
>>>> Thanks & Best Regards
>>>>
>>>> Henry PAN
>>>> Sr. Lead Cloud Architect
>>>> (425) 802--3975
>>>> https://www.linkedin.com/in/henrypan1
>>>>
>>>>
>>>>
>>>> On Wed, Jan 21, 2026 at 7:07 AM Ashaman Kingpin <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I’m looking for some guidance on a Cassandra 5.0.x startup issue we’re
>>>>> seeing and wanted to ask the user list if this behavior is expected or
>>>>> already known.
>>>>>
>>>>> We’re running a homogeneous 5.0.4 (also tested with 5.0.6) cluster
>>>>> with a relatively large number of keyspaces, tables, and SAI indexes. On
>>>>> initial cluster creation and provisioning of multiple keyspaces, 
>>>>> everything
>>>>> operates as expected. However, after stopping the cluster and restarting
>>>>> all nodes, only the first node comes up successfully. Subsequent nodes 
>>>>> fail
>>>>> during startup with an assertion in the gossip thread while serializing 
>>>>> the
>>>>> SAI index status metadata.
>>>>>
>>>>> ERROR [GossipStage:1] 2025-12-22 17:20:10,365 
>>>>> JVMStabilityInspector.java:70 - Exception in thread 
>>>>> Thread[GossipStage:1,5,GossipStage]
>>>>> java.lang.RuntimeException: java.lang.AssertionError
>>>>>         at 
>>>>> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:108)
>>>>>         at 
>>>>> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
>>>>>         at 
>>>>> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
>>>>>         at 
>>>>> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
>>>>>         at 
>>>>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>>>>         at 
>>>>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>>>>         at 
>>>>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>>>>         at java.base/java.lang.Thread.run(Thread.java:834)
>>>>> Caused by: java.lang.AssertionError: null
>>>>>         at org.apache.cassandra.db.TypeSizes.sizeof(TypeSizes.java:44)
>>>>>         at 
>>>>> org.apache.cassandra.gms.VersionedValue$VersionedValueSerializer.serializedSize(VersionedValue.java:381)
>>>>>         at 
>>>>> org.apache.cassandra.gms.VersionedValue$VersionedValueSerializer.serializedSize(VersionedValue.java:359)
>>>>>         at 
>>>>> org.apache.cassandra.gms.EndpointStateSerializer.serializedSize(EndpointState.java:344)
>>>>>         at 
>>>>> org.apache.cassandra.gms.EndpointStateSerializer.serializedSize(EndpointState.java:300)
>>>>>         at 
>>>>> org.apache.cassandra.gms.GossipDigestAckSerializer.serializedSize(GossipDigestAck.java:96)
>>>>>         at 
>>>>> org.apache.cassandra.gms.GossipDigestAckSerializer.serializedSize(GossipDigestAck.java:61)
>>>>>         at 
>>>>> org.apache.cassandra.net.Message$Serializer.payloadSize(Message.java:1088)
>>>>>         at org.apache.cassandra.net.Message.payloadSize(Message.java:1131)
>>>>>         at 
>>>>> org.apache.cassandra.net.Message$Serializer.serializedSize(Message.java:769)
>>>>>
>>>>> It seems there was a fix to this same issue as reported in this DBA
>>>>> Stack Exchange post
>>>>> <https://dba.stackexchange.com/questions/343389/schema-changes-on-5-0-result-in-gossip-failures-o-a-c-db-db-typesizes-sizeof>
>>>>>  ((CASSANDRA-20058
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-20058>).   It seems
>>>>> to me though that the fix described in that post and ticket, included in
>>>>> Cassandra 5.0.3, is incomplete?  From what I can tell, the fix seems to
>>>>> only be activated once the gossip state of the cluster has converged but
>>>>> the error seems to occur before this happens.  At the point of the error,
>>>>> the minimum cluster version appears to be treated as unknown, which causes
>>>>> Cassandra to fall back to the legacy (pre-5.0.3) index-status 
>>>>> serialization
>>>>> format. In our case, that legacy representation becomes large enough to
>>>>> trigger the assertion, preventing the node from joining. Because the node
>>>>> never joins, gossip never converges, and the newer 5.0.3+ compressed 
>>>>> format
>>>>> is never enabled.
>>>>>
>>>>> This effectively leaves the cluster stuck in a startup loop where only
>>>>> the first node can come up.
>>>>>
>>>>> As a sanity check, I locally modified the version-gating logic in
>>>>> *IndexStatusManager.java *for the index-status serialization to
>>>>> always use the newer compact format during startup, and with that change
>>>>> the cluster started successfully.
>>>>>
>>>>> private static boolean shouldWriteLegacyStatusFormat(CassandraVersion 
>>>>> minVersion)
>>>>>     {
>>>>>         return false; // return minVersion == null || (minVersion.major 
>>>>> == 5 && minVersion.minor == 0 && minVersion.patch < 3);
>>>>>     }
>>>>>
>>>>> This makes me suspect the issue is related to bootstrap ordering or
>>>>> version detection rather than data corruption or configuration.
>>>>>
>>>>> I posted a more detailed write-up
>>>>> <https://dba.stackexchange.com/questions/349488/cassandra-5-0-4-startup-deadlock-gossip-uses-pre-5-0-3-encoding-due-to-version>
>>>>>  (with
>>>>> stack traces and code references) on DBA StackExchange a few weeks ago but
>>>>> haven’t received any feedback yet, so I wanted to ask here:
>>>>>
>>>>>
>>>>>    -
>>>>>
>>>>>    Is this startup/version-gating behavior expected in 5.0.x?
>>>>>    -
>>>>>
>>>>>    Is this a known limitation or bug?
>>>>>    -
>>>>>
>>>>>    Is there a recommended way to bootstrap or restart clusters in
>>>>>    this state?
>>>>>
>>>>> Any insight would be appreciated. Happy to provide logs or additional
>>>>> details if helpful.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Nicholas
>>>>>
>>>>

Reply via email to