Hey guys, I'm troubleshooting some issues with our cluster under some production load and scaling.
If we add new drillbits to a cluster, as soon as it joins the cluster, performance degrades severely (queries that usually take 1s would take 60s, for example). After a few minutes, it recovers just fine and all is normal again. What I assume is happening is that the new drillbit is still initializing or "warming up" but already made itself available to start taking work. This means that queries would end up waiting for this drillbit to initialize before the query returns. I haven't confirmed this in the profiles as yet (as we have a fair bit of load so I haven't isolated the individual long-running queries), but I'll keep investigating. In the mean time, does that theory sound possible? And if so, what initialization/warm up is the drillbit doing? Furthermore, could we not delay it joining the cluster for active work until it is completely ready to undergo the work? We're considering running some sort of autoscaling to handle varying load, so this would be really crucial for us! Any thoughts or pointing me in the right direction would be great.
