HBase stores HFiles (and other files) in HDFS, so HDFS replication should take care of the lost replica. It may indeed happen that region server will be reading the files from remote machine; if it continues functioning however, eventually compaction of the files will restore locality.
In case of full failure ideally there should not be any downtime; some requests can just take long as they retry thru the downtime of one node. Often, recovery can be very fast. Take a look at HBASE-5843, it has some summary of MTTR (mean time to recover) improvement work done recently. There's also HBASE-7006 and some related JIRAs that may allow us to serve the region faster after recovery. On Mon, Jun 10, 2013 at 6:42 PM, Lucas Stanley <[email protected]> wrote: > Hi, > > I'm trying to understand how failures are handled in HBase. > > One Disk Failure: > If one disk on a Region Server fails and some HFiles are lost on that > machine, how will that Region Server handle incoming reads for the missing > data? Will the HRegion read from a remote node's replicated HFile over the > network? Will this cause the reads to be slow for this particular set of > data? > > > Full node failure: > Also, if a Region Server complete crashes/panics, will some reads fail for > a few minutes? If that crashed Region Server was hosting 5 regions, I guess > it will take some time for other nodes to take over those regions and > replay the WAL. So, can I expect a few minutes of downtime before I can > read from the crashed regions again? >
