Re: ClusterTopologyException retriability and idempotency of cache operations during rolling restarts

Jeremy McMillan Fri, 06 Mar 2026 08:18:16 -0800

A) If there is never any partition loss, then we assume all of the data is
intact.
B) Topology changes are disruptive. These messages are a warning that you
are pushing your cluster's ability to maintain the topology and flirting
with partition loss.


If you have decided to accept these kinds of warnings, you have left the
world where guarantees mean anything. Maybe you should slow down your
rolling restart. Try the operator pattern so that Kubernetes isn't taking
the next node out of the topology before the topology has settled from the
prior step. Maybe implement a thin client that executes a Kubernetes
operation while listening for remote Ignite events to confirm the operation
has succeeded to perform the rolling restart. Please share your code!

On Thu, Mar 5, 2026 at 10:20 AM Felipe Kersting <[email protected]>
wrote:

> Hello Ignite devs,
>
> We are in the process of introducing Apache Ignite into our application
> (replacing another technology) and are currently testing our rollout
> strategy.
>
> During a rollout, Ignite server nodes are terminated and new nodes are
> started one after another (Kubernetes-style rolling update). As a result,
> nodes leave and join the cluster continuously. At the moment we are testing
> a pure in-memory deployment (no persistence / no baseline topology
> configured).
>
> While running these tests, we noticed that thick clients commonly hit
> `ClusterTopologyException` during the rollout—most often when interacting
> with caches (typically wrapped in `CacheException`). We have also seen
> other rollout-related issues (including the deadlock previously discussed
> in this thread), but this email focuses specifically on
> `ClusterTopologyException`.
>
> The documentation suggests that callers should "wait on the future and use
> retry logic":
> [
> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions)
>
> In our case, the future embedded in the exception is frequently `null`, so
> we implemented a retry layer that retries cache operations with backoff
> whenever `ClusterTopologyException` is thrown. This seems to keep the
> client stable during rollouts, though at the cost of extra latency.
>
> Our question is about correctness / idempotency: is it safe to blindly
> retry cache operations when `ClusterTopologyException` occurs?
>
> In particular, we are concerned about the following operations:
>
> * `IgniteCache::putAll`
> * `IgniteCache::clear`
> * `IgniteCache::removeAll`
> * `IgniteCache::forEach`
> * `IgniteCache::invoke`
> * `IgniteCache::invokeAll`
>
> For example:
>
> * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`, is
> it guaranteed that the operation was not executed for any key, or can it be
> partially executed for a subset of keys?
> * Likewise for `invoke` / `invokeAll`: is it guaranteed that the
> `EntryProcessor` was not executed at all, or could it have been executed
> (fully or partially) before the exception was surfaced to the client?
>
> If partial execution is possible, then a blind retry could result in
> duplicate effects for an arbitrary subset of keys, which could be
> problematic depending on the operation semantics.
>
> Any guidance on the expected guarantees here (or best practices for
> designing a safe retry strategy in this scenario) would be greatly
> appreciated.
>
> Thank you,
> Felipe
>

Re: ClusterTopologyException retriability and idempotency of cache operations during rolling restarts

Reply via email to