Hello Ignite users! I have a use case where I am doing SQL queries on a sharded cache, and I need to ensure that SQL queries always return The Right Answer even if some nodes in the ring are lost. As I have rigorously confirmed, SQL queries only apply to data in the cache (as opposed to in the write-through persistent store but lost from the cache). Also, when you lose a node, you don't lose persisted data, but data IS now gone from the cache (unless there is an in-cache backup of the relevant cache partitions).
Now, I *could* do this by just increasing the backup factor for the cache equal to the number of nodes I can stand to lose, and then setting a TopologyValidator on the cache to ensure I always have more nodes in the ring than that number. If the TopologyValidator ever returns a number of nodes below this survivability threshold, I crash the app and let everything get reloaded from the persistent store when the nodes automatically start back up. This technique has a lot of false positives, where we lose too many nodes, but slowly enough that Ignite is well-able to shift the data around to avoid data loss and so we shouldn't have had to crash the app. Therefore, I would rather be a little smarter about this for the sake of uptime. Ideally, in the TopologyValidator logic, while reads and writes to the cache are blocked, I would be able to: 1.) Detect when a lost partition has no viable backup, 2.) Reload from the persistent store. The problem I am facing is, I can't find a clean and efficient way of figuring out #1 from the information the ToplogyValidator gives you. And even if I could, #2 hangs forever, which makes sense because the cache isn't readable or writeable until AFTER the topology has been validated. Has anyone faced a similar challenge and has some wisdom to share? Am I making this way more complicated than it needs to be? Thanks in advance, Cody
