Inline below as well. On Fri, Feb 10, 2017 at 1:23 PM, Weber, Richard <[email protected]> wrote:
> > > > On Fri, Feb 10, 2017 at 10:32 AM, Weber, Richard <[email protected]> > wrote: > > I definitely would push for prioritization on this. > > > > Our main use case is less about multiple racks and failure, and more about > functionality during the install process. Our clusters are installed in > logical regions, and we install 1/3 of a region at a time. That means 1/3 > of the cluster can be down for the SW install, reboot, or something else. > Allowing rack locality to be logically defined will allow the data to still > be available during normal maintenance operations. > > > > That's an interesting use case. How long is the 1/3rd of the cluster > typically down for? I'd be afraid that, if it's down for more than a couple > minutes, there's a decent chance of losing one server in the other 2/3 > region, which would leave a tablet at 1/3 replication and unavailable for > writes or consistent reads. Is that acceptable for your target use cases? > > > > Nodes would be down typically for 5-15 minutes or so. Are you saying that > if 1 node goes down, there's an increased chance of one of the other 2 > going down as well? > Not that it increases the chances of the other two going down, but it does increase the impact. > That doesn't sound good if losing a node increases the instability of the > system. Additionally, wouldn't the tablets start re-replicating the data > if 2/3 of the nodes detect the node is down for too long? > Yep - the default setting is 5 minutes iirc. > > > How does the system typically handle a node failing? Is re-replication of > data not automatic? (I haven't experimented with this enough) > > > Right - after 5 minutes, the leader replica of a tablet will decide that a node is dead, and evict it. The master will then notice that it's under-replicated and make a new replica. There's a design we're working on to make it so that, instead of evicting the presumed-dead replica, it would recruit the new 4th replica first and get it online. That way, if the "dead" one comes back, it can rejoin transparently without having to wait for the full new copy to be made. > Our install process is along the line of: > > 1) copy software to target machine > > 2) shut down services on machine > > 3) expand software to final location > > 4) reboot (if new kernel) > > 5) restart services. > OK, hopefully that happens quickly usually. I've seen other orgs try this and have issues where their restart ends up running fsck on 12x4TB drives and the restart takes an hour, though :) -Todd -- Todd Lipcon Software Engineer, Cloudera
