Hi Jeremy,

Thanks for the reply!

We have full control over the rolling upgrade process. We roll only one pod
at a time. A pod is only allowed to shut down after it has successfully
left the Ignite grid. Likewise, a new pod is only marked as ready, allowing
the rollout to proceed, once it has successfully joined the grid.

During the bootstrap of a new pod, we simply call `Ignition.start(cfg)` and
wait for it to complete. The rollout only continues after this call
finishes successfully.

When the service is started from scratch, we also have additional logic to
ensure that we only activate the cluster
(`igniteClient.cluster().state(ClusterState.ACTIVE)`) after all members
have joined the grid. That said, I believe this is orthogonal to the
current discussion, since during rolling upgrades the cluster is already in
the `ACTIVE` state.

During pod shutdown, we rely on `Ignition.stop(cancel=true)`. We invoke it
synchronously and wait for it to complete before allowing the pod to be
deleted.

In addition, all of our caches are configured with backups. By ensuring
that only one pod is deleted at a time, we try to guarantee that there is
always a backup available to take over as the new primary. This seems to
work in general, as we can verify that when backups are not configured, the
rollout consistently results in loss of state.

Please also note that, although we do observe transient
ClusterTopologyException errors during the rollout, we do not actually lose
cache data. Once the rollout settles, the data stored in the affected
caches is always still available.

Even though we do control the full rollout process, we do not explicitly
wait for the topology to become "settled," as you suggested. Do you have
any examples or guidance on which Ignite APIs we could use during pod
startup or shutdown to determine when it is safe to proceed?

Thank you!
Felipe

Em sex., 6 de mar. de 2026 às 13:17, Jeremy McMillan <[email protected]>
escreveu:

> A) If there is never any partition loss, then we assume all of the data is
> intact.
> B) Topology changes are disruptive. These messages are a warning that you
> are pushing your cluster's ability to maintain the topology and flirting
> with partition loss.
>
> If you have decided to accept these kinds of warnings, you have left the
> world where guarantees mean anything. Maybe you should slow down your
> rolling restart. Try the operator pattern so that Kubernetes isn't taking
> the next node out of the topology before the topology has settled from the
> prior step. Maybe implement a thin client that executes a Kubernetes
> operation while listening for remote Ignite events to confirm the operation
> has succeeded to perform the rolling restart. Please share your code!
>
> On Thu, Mar 5, 2026 at 10:20 AM Felipe Kersting <[email protected]>
> wrote:
>
>> Hello Ignite devs,
>>
>> We are in the process of introducing Apache Ignite into our application
>> (replacing another technology) and are currently testing our rollout
>> strategy.
>>
>> During a rollout, Ignite server nodes are terminated and new nodes are
>> started one after another (Kubernetes-style rolling update). As a result,
>> nodes leave and join the cluster continuously. At the moment we are testing
>> a pure in-memory deployment (no persistence / no baseline topology
>> configured).
>>
>> While running these tests, we noticed that thick clients commonly hit
>> `ClusterTopologyException` during the rollout—most often when interacting
>> with caches (typically wrapped in `CacheException`). We have also seen
>> other rollout-related issues (including the deadlock previously discussed
>> in this thread), but this email focuses specifically on
>> `ClusterTopologyException`.
>>
>> The documentation suggests that callers should "wait on the future and
>> use retry logic":
>> [
>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions)
>>
>> In our case, the future embedded in the exception is frequently `null`,
>> so we implemented a retry layer that retries cache operations with backoff
>> whenever `ClusterTopologyException` is thrown. This seems to keep the
>> client stable during rollouts, though at the cost of extra latency.
>>
>> Our question is about correctness / idempotency: is it safe to blindly
>> retry cache operations when `ClusterTopologyException` occurs?
>>
>> In particular, we are concerned about the following operations:
>>
>> * `IgniteCache::putAll`
>> * `IgniteCache::clear`
>> * `IgniteCache::removeAll`
>> * `IgniteCache::forEach`
>> * `IgniteCache::invoke`
>> * `IgniteCache::invokeAll`
>>
>> For example:
>>
>> * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`, is
>> it guaranteed that the operation was not executed for any key, or can it be
>> partially executed for a subset of keys?
>> * Likewise for `invoke` / `invokeAll`: is it guaranteed that the
>> `EntryProcessor` was not executed at all, or could it have been executed
>> (fully or partially) before the exception was surfaced to the client?
>>
>> If partial execution is possible, then a blind retry could result in
>> duplicate effects for an arbitrary subset of keys, which could be
>> problematic depending on the operation semantics.
>>
>> Any guidance on the expected guarantees here (or best practices for
>> designing a safe retry strategy in this scenario) would be greatly
>> appreciated.
>>
>> Thank you,
>> Felipe
>>
>

Reply via email to