Hi Humphrey, Thank you for the response.
We want to try this suggestion and see if it will reduce the amount of exceptions. Is there any recommended guideline or best practice for waiting until rebalancing has fully completed? I also found another related mailing-list thread from 2019: https://lists.apache.org/thread/9sxj0rovzs0pq4qy6g78t784xr9pv6xp In that discussion, an event-based approach is suggested, but it is still not entirely clear to us whether this is the recommended approach. Thanks, Felipe Em seg., 9 de mar. de 2026 às 20:28, Humphrey Lopez <[email protected]> escreveu: > I think when a node leaves the cluster there is rebalancing happening. And > when a new node joins (the new pod started) also rebalancing will happen. I > think you need to wait for the rebalancing to finish before stopping a new > node. > > Humphrey > > On 6 Mar 2026, at 21:08, Felipe Kersting <[email protected]> wrote: > > > Hi Jeremy, > > Thanks for the reply! > > We have full control over the rolling upgrade process. We roll only one > pod at a time. A pod is only allowed to shut down after it has successfully > left the Ignite grid. Likewise, a new pod is only marked as ready, allowing > the rollout to proceed, once it has successfully joined the grid. > > During the bootstrap of a new pod, we simply call `Ignition.start(cfg)` > and wait for it to complete. The rollout only continues after this call > finishes successfully. > > When the service is started from scratch, we also have additional logic to > ensure that we only activate the cluster > (`igniteClient.cluster().state(ClusterState.ACTIVE)`) after all members > have joined the grid. That said, I believe this is orthogonal to the > current discussion, since during rolling upgrades the cluster is already in > the `ACTIVE` state. > > During pod shutdown, we rely on `Ignition.stop(cancel=true)`. We invoke it > synchronously and wait for it to complete before allowing the pod to be > deleted. > > In addition, all of our caches are configured with backups. By ensuring > that only one pod is deleted at a time, we try to guarantee that there is > always a backup available to take over as the new primary. This seems to > work in general, as we can verify that when backups are not configured, the > rollout consistently results in loss of state. > > Please also note that, although we do observe transient > ClusterTopologyException errors during the rollout, we do not actually lose > cache data. Once the rollout settles, the data stored in the affected > caches is always still available. > > Even though we do control the full rollout process, we do not explicitly > wait for the topology to become "settled," as you suggested. Do you have > any examples or guidance on which Ignite APIs we could use during pod > startup or shutdown to determine when it is safe to proceed? > > Thank you! > Felipe > > Em sex., 6 de mar. de 2026 às 13:17, Jeremy McMillan <[email protected]> > escreveu: > >> A) If there is never any partition loss, then we assume all of the data >> is intact. >> B) Topology changes are disruptive. These messages are a warning that you >> are pushing your cluster's ability to maintain the topology and flirting >> with partition loss. >> >> If you have decided to accept these kinds of warnings, you have left the >> world where guarantees mean anything. Maybe you should slow down your >> rolling restart. Try the operator pattern so that Kubernetes isn't taking >> the next node out of the topology before the topology has settled from the >> prior step. Maybe implement a thin client that executes a Kubernetes >> operation while listening for remote Ignite events to confirm the operation >> has succeeded to perform the rolling restart. Please share your code! >> >> On Thu, Mar 5, 2026 at 10:20 AM Felipe Kersting <[email protected]> >> wrote: >> >>> Hello Ignite devs, >>> >>> We are in the process of introducing Apache Ignite into our application >>> (replacing another technology) and are currently testing our rollout >>> strategy. >>> >>> During a rollout, Ignite server nodes are terminated and new nodes are >>> started one after another (Kubernetes-style rolling update). As a result, >>> nodes leave and join the cluster continuously. At the moment we are testing >>> a pure in-memory deployment (no persistence / no baseline topology >>> configured). >>> >>> While running these tests, we noticed that thick clients commonly hit >>> `ClusterTopologyException` during the rollout—most often when interacting >>> with caches (typically wrapped in `CacheException`). We have also seen >>> other rollout-related issues (including the deadlock previously discussed >>> in this thread), but this email focuses specifically on >>> `ClusterTopologyException`. >>> >>> The documentation suggests that callers should "wait on the future and >>> use retry logic": >>> [ >>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions) >>> >>> In our case, the future embedded in the exception is frequently `null`, >>> so we implemented a retry layer that retries cache operations with backoff >>> whenever `ClusterTopologyException` is thrown. This seems to keep the >>> client stable during rollouts, though at the cost of extra latency. >>> >>> Our question is about correctness / idempotency: is it safe to blindly >>> retry cache operations when `ClusterTopologyException` occurs? >>> >>> In particular, we are concerned about the following operations: >>> >>> * `IgniteCache::putAll` >>> * `IgniteCache::clear` >>> * `IgniteCache::removeAll` >>> * `IgniteCache::forEach` >>> * `IgniteCache::invoke` >>> * `IgniteCache::invokeAll` >>> >>> For example: >>> >>> * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`, >>> is it guaranteed that the operation was not executed for any key, or can it >>> be partially executed for a subset of keys? >>> * Likewise for `invoke` / `invokeAll`: is it guaranteed that the >>> `EntryProcessor` was not executed at all, or could it have been executed >>> (fully or partially) before the exception was surfaced to the client? >>> >>> If partial execution is possible, then a blind retry could result in >>> duplicate effects for an arbitrary subset of keys, which could be >>> problematic depending on the operation semantics. >>> >>> Any guidance on the expected guarantees here (or best practices for >>> designing a safe retry strategy in this scenario) would be greatly >>> appreciated. >>> >>> Thank you, >>> Felipe >>> >>
