@DaveHarvey, I'll look at that tomorrow. Seems potentially complicated, but if that's what has to happen we'll figure it out.
Interestingly, cutting the cluster to half as many nodes (by reducing the number of backups) seems to have resolved the issue. Is there a guideline for how large a cluster should be? We were running a single 44-node cluster, with 3 data backups (4 total copies) and hitting the issue consistently. I switched to running two separate clusters, each with 22 nodes using 1 data backup (2 total copies). The smaller clusters seem to work perfectly every time, though I haven't tried them as much. @smovva - We're still actively experimenting with instance and cluster sizing. We were running on c4.4xl instances. However we were barely using the CPUs, but consistently have memory issues (using a 20GB heap, plus a bit of off-heap). We just switched to r4.2xl instances which is working better for us so far, and is a bit cheaper. However I would imagine that the optimal size depends on your use case - it's basically a tradeoff between the memory, CPU, networking and operational cost requirements of your use case. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
