It appears to take 30 minutes or so for HBase to recover from the failure of the regionserver holding the ROOT role. Please let me know what options are available to more quickly recover from such a situation, as when this happens our applications/SLAs are impacted.
It would also be good to be able to quickly recover from a failure of the regionserver which owns the .META. table. During HBase startup, a random server is elected to manage the ROOT and .META. tables (different servers). This creates a single point of failure. At the very least, perhaps we can find a way to force which server is selected for this role, perhaps even just via startup order. We could then assign a server which doesn't participate in flow tasks (no tasktracker), and so would be more stable. There may also be a config option for this. Wondering if there is a way to force election of a new ROOT/META owner within a minute or so instead of 30+ minutes.
