Lucas: You can also find some interesting discussion in HBASE-8701 where we try to handle the case where concurrent writes to the region server carry the same timestamp as some of the Puts that are being replayed.
Cheers On Mon, Jun 10, 2013 at 7:18 PM, Sergey Shelukhin <[email protected]>wrote: > HBase stores HFiles (and other files) in HDFS, so HDFS replication should > take care of the lost replica. It may indeed happen that region server will > be reading the files from remote machine; if it continues functioning > however, eventually compaction of the files will restore locality. > > In case of full failure ideally there should not be any downtime; some > requests can just take long as they retry thru the downtime of one node. > Often, recovery can be very fast. > Take a look at HBASE-5843, it has some summary of MTTR (mean time to > recover) improvement work done recently. > There's also HBASE-7006 and some related JIRAs that may allow us to serve > the region faster after recovery. > > On Mon, Jun 10, 2013 at 6:42 PM, Lucas Stanley <[email protected]> > wrote: > > > Hi, > > > > I'm trying to understand how failures are handled in HBase. > > > > One Disk Failure: > > If one disk on a Region Server fails and some HFiles are lost on that > > machine, how will that Region Server handle incoming reads for the > missing > > data? Will the HRegion read from a remote node's replicated HFile over > the > > network? Will this cause the reads to be slow for this particular set of > > data? > > > > > > Full node failure: > > Also, if a Region Server complete crashes/panics, will some reads fail > for > > a few minutes? If that crashed Region Server was hosting 5 regions, I > guess > > it will take some time for other nodes to take over those regions and > > replay the WAL. So, can I expect a few minutes of downtime before I can > > read from the crashed regions again? > > >
