Hello Ignite devs, We are in the process of introducing Apache Ignite into our application (replacing another technology) and are currently testing our rollout strategy.
During a rollout, Ignite server nodes are terminated and new nodes are started one after another (Kubernetes-style rolling update). As a result, nodes leave and join the cluster continuously. At the moment we are testing a pure in-memory deployment (no persistence / no baseline topology configured). While running these tests, we noticed that thick clients commonly hit `ClusterTopologyException` during the rollout—most often when interacting with caches (typically wrapped in `CacheException`). We have also seen other rollout-related issues (including the deadlock previously discussed in this thread), but this email focuses specifically on `ClusterTopologyException`. The documentation suggests that callers should "wait on the future and use retry logic": [ https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions) In our case, the future embedded in the exception is frequently `null`, so we implemented a retry layer that retries cache operations with backoff whenever `ClusterTopologyException` is thrown. This seems to keep the client stable during rollouts, though at the cost of extra latency. Our question is about correctness / idempotency: is it safe to blindly retry cache operations when `ClusterTopologyException` occurs? In particular, we are concerned about the following operations: * `IgniteCache::putAll` * `IgniteCache::clear` * `IgniteCache::removeAll` * `IgniteCache::forEach` * `IgniteCache::invoke` * `IgniteCache::invokeAll` For example: * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`, is it guaranteed that the operation was not executed for any key, or can it be partially executed for a subset of keys? * Likewise for `invoke` / `invokeAll`: is it guaranteed that the `EntryProcessor` was not executed at all, or could it have been executed (fully or partially) before the exception was surfaced to the client? If partial execution is possible, then a blind retry could result in duplicate effects for an arbitrary subset of keys, which could be problematic depending on the operation semantics. Any guidance on the expected guarantees here (or best practices for designing a safe retry strategy in this scenario) would be greatly appreciated. Thank you, Felipe
