Re: ClusterTopologyException retriability and idempotency of cache operations during rolling restarts

Jeremy McMillan Tue, 10 Mar 2026 06:37:36 -0700

Sorry, my morning coffee isn't working yet. You aren't making permanent
topology changes, so ignore rebalancing.


For rolling restarts follow the discovery events.
https://www.gridgain.com/docs/gridgain8/latest/developers-guide/events/events#discovery-events

On Tue, Mar 10, 2026, 08:31 Jeremy McMillan <[email protected]> wrote:

> Look for the code generating the rebalancing status messages in the logs.
> Once you become familiar with that, the solution will be clear.
>
> On Fri, Mar 6, 2026, 14:08 Felipe Kersting <[email protected]>
> wrote:
>
>> Hi Jeremy,
>>
>> Thanks for the reply!
>>
>> We have full control over the rolling upgrade process. We roll only one
>> pod at a time. A pod is only allowed to shut down after it has successfully
>> left the Ignite grid. Likewise, a new pod is only marked as ready, allowing
>> the rollout to proceed, once it has successfully joined the grid.
>>
>> During the bootstrap of a new pod, we simply call `Ignition.start(cfg)`
>> and wait for it to complete. The rollout only continues after this call
>> finishes successfully.
>>
>> When the service is started from scratch, we also have additional logic
>> to ensure that we only activate the cluster
>> (`igniteClient.cluster().state(ClusterState.ACTIVE)`) after all members
>> have joined the grid. That said, I believe this is orthogonal to the
>> current discussion, since during rolling upgrades the cluster is already in
>> the `ACTIVE` state.
>>
>> During pod shutdown, we rely on `Ignition.stop(cancel=true)`. We invoke
>> it synchronously and wait for it to complete before allowing the pod to be
>> deleted.
>>
>> In addition, all of our caches are configured with backups. By ensuring
>> that only one pod is deleted at a time, we try to guarantee that there is
>> always a backup available to take over as the new primary. This seems to
>> work in general, as we can verify that when backups are not configured, the
>> rollout consistently results in loss of state.
>>
>> Please also note that, although we do observe transient
>> ClusterTopologyException errors during the rollout, we do not actually lose
>> cache data. Once the rollout settles, the data stored in the affected
>> caches is always still available.
>>
>> Even though we do control the full rollout process, we do not explicitly
>> wait for the topology to become "settled," as you suggested. Do you have
>> any examples or guidance on which Ignite APIs we could use during pod
>> startup or shutdown to determine when it is safe to proceed?
>>
>> Thank you!
>> Felipe
>>
>> Em sex., 6 de mar. de 2026 às 13:17, Jeremy McMillan <[email protected]>
>> escreveu:
>>
>>> A) If there is never any partition loss, then we assume all of the data
>>> is intact.
>>> B) Topology changes are disruptive. These messages are a warning that
>>> you are pushing your cluster's ability to maintain the topology and
>>> flirting with partition loss.
>>>
>>> If you have decided to accept these kinds of warnings, you have left the
>>> world where guarantees mean anything. Maybe you should slow down your
>>> rolling restart. Try the operator pattern so that Kubernetes isn't taking
>>> the next node out of the topology before the topology has settled from the
>>> prior step. Maybe implement a thin client that executes a Kubernetes
>>> operation while listening for remote Ignite events to confirm the operation
>>> has succeeded to perform the rolling restart. Please share your code!
>>>
>>> On Thu, Mar 5, 2026 at 10:20 AM Felipe Kersting <
>>> [email protected]> wrote:
>>>
>>>> Hello Ignite devs,
>>>>
>>>> We are in the process of introducing Apache Ignite into our application
>>>> (replacing another technology) and are currently testing our rollout
>>>> strategy.
>>>>
>>>> During a rollout, Ignite server nodes are terminated and new nodes are
>>>> started one after another (Kubernetes-style rolling update). As a result,
>>>> nodes leave and join the cluster continuously. At the moment we are testing
>>>> a pure in-memory deployment (no persistence / no baseline topology
>>>> configured).
>>>>
>>>> While running these tests, we noticed that thick clients commonly hit
>>>> `ClusterTopologyException` during the rollout—most often when interacting
>>>> with caches (typically wrapped in `CacheException`). We have also seen
>>>> other rollout-related issues (including the deadlock previously discussed
>>>> in this thread), but this email focuses specifically on
>>>> `ClusterTopologyException`.
>>>>
>>>> The documentation suggests that callers should "wait on the future and
>>>> use retry logic":
>>>> [
>>>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions)
>>>>
>>>> In our case, the future embedded in the exception is frequently `null`,
>>>> so we implemented a retry layer that retries cache operations with backoff
>>>> whenever `ClusterTopologyException` is thrown. This seems to keep the
>>>> client stable during rollouts, though at the cost of extra latency.
>>>>
>>>> Our question is about correctness / idempotency: is it safe to blindly
>>>> retry cache operations when `ClusterTopologyException` occurs?
>>>>
>>>> In particular, we are concerned about the following operations:
>>>>
>>>> * `IgniteCache::putAll`
>>>> * `IgniteCache::clear`
>>>> * `IgniteCache::removeAll`
>>>> * `IgniteCache::forEach`
>>>> * `IgniteCache::invoke`
>>>> * `IgniteCache::invokeAll`
>>>>
>>>> For example:
>>>>
>>>> * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`,
>>>> is it guaranteed that the operation was not executed for any key, or can it
>>>> be partially executed for a subset of keys?
>>>> * Likewise for `invoke` / `invokeAll`: is it guaranteed that the
>>>> `EntryProcessor` was not executed at all, or could it have been executed
>>>> (fully or partially) before the exception was surfaced to the client?
>>>>
>>>> If partial execution is possible, then a blind retry could result in
>>>> duplicate effects for an arbitrary subset of keys, which could be
>>>> problematic depending on the operation semantics.
>>>>
>>>> Any guidance on the expected guarantees here (or best practices for
>>>> designing a safe retry strategy in this scenario) would be greatly
>>>> appreciated.
>>>>
>>>> Thank you,
>>>> Felipe
>>>>
>>>

Re: ClusterTopologyException retriability and idempotency of cache operations during rolling restarts

Reply via email to