Hello Ignite devs,

We are in the process of introducing Apache Ignite into our application
(replacing another technology) and are currently testing our rollout
strategy.

During a rollout, Ignite server nodes are terminated and new nodes are
started one after another (Kubernetes-style rolling update). As a result,
nodes leave and join the cluster continuously. At the moment we are testing
a pure in-memory deployment (no persistence / no baseline topology
configured).

While running these tests, we noticed that thick clients commonly hit
`ClusterTopologyException` during the rollout—most often when interacting
with caches (typically wrapped in `CacheException`). We have also seen
other rollout-related issues (including the deadlock previously discussed
in this thread), but this email focuses specifically on
`ClusterTopologyException`.

The documentation suggests that callers should "wait on the future and use
retry logic":
[
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions)

In our case, the future embedded in the exception is frequently `null`, so
we implemented a retry layer that retries cache operations with backoff
whenever `ClusterTopologyException` is thrown. This seems to keep the
client stable during rollouts, though at the cost of extra latency.

Our question is about correctness / idempotency: is it safe to blindly
retry cache operations when `ClusterTopologyException` occurs?

In particular, we are concerned about the following operations:

* `IgniteCache::putAll`
* `IgniteCache::clear`
* `IgniteCache::removeAll`
* `IgniteCache::forEach`
* `IgniteCache::invoke`
* `IgniteCache::invokeAll`

For example:

* If `ClusterTopologyException` is thrown from `IgniteCache::forEach`, is
it guaranteed that the operation was not executed for any key, or can it be
partially executed for a subset of keys?
* Likewise for `invoke` / `invokeAll`: is it guaranteed that the
`EntryProcessor` was not executed at all, or could it have been executed
(fully or partially) before the exception was surfaced to the client?

If partial execution is possible, then a blind retry could result in
duplicate effects for an arbitrary subset of keys, which could be
problematic depending on the operation semantics.

Any guidance on the expected guarantees here (or best practices for
designing a safe retry strategy in this scenario) would be greatly
appreciated.

Thank you,
Felipe

Reply via email to