Nice explanation, Anoop. This deserves to be part of Hbase wiki. On Wed, Aug 22, 2012 at 5:34 AM, Anoop Sam John <[email protected]> wrote:
> > I could be wrong. I think HFile index block (which is located at the end > >> of HFile) is a binary search tree containing all row-key values (of the > >> HFile) in the binary search tree. Searching a specific row-key in the > >> binary search tree could easily find whether a row-key exists (some > node in > >> the tree has the same row-key value) or not. Why we need load every > block > >> to find if the row exists? > > I think there is some confusion with you people regarding the blooms and > the block index.I will try to clarify this point. > Block index will be there with every HFile. Within an HFile the data will > be written as multiple blocks. While reading data block by block only HBase > read data from the HDFS layer. The block index contains the information > regarding the blocks within that HFile. The information include the start > and end rowkeys which resides in that particular block and the block > information like offset of that block and its length etc. Now when a > request comes for getting a rowkey 'x' all the HFiles within that region > need to be checked.[KV can be present in any of the HFile] Now in order to > know this row will be present in which block within an HFile, this block > index will be used. Well this block index will be there in memory always. > This lookup will tell only the possible block in which the row is present. > HBase will load that block and will read through it to get the row which we > are interested in now. > Bloom is like it will have information about each and every row added into > that HFile[Block index wont have info about each and every row]. This bloom > information will be there in memory always. So when a read request to get > row 'x' in an Hfile comes, 1st the bloom is checked whether this row is > there in this file or not. If this is not there, as per the bloom, no block > at all will be fetched. But if bloom is not enabled, we might find one > block which is having a row range such that 'x' comes in between and Hbase > will load that block. So usage of blooms can avoid this IO. Hope this is > clear for you now. > > -Anoop- > ________________________________________ > From: Lin Ma [[email protected]] > Sent: Wednesday, August 22, 2012 5:41 PM > To: J Mohamed Zahoor; [email protected] > Subject: Re: Using HBase serving to replace memcached > > Thanks Zahoor, > > I read through the document you referred to, I am confused about what means > leaf-level index, intermediate-level index and root-level index. It is > appreciate if you could give more details what they are, or point me to the > related documents. > > BTW: the document you pointed me is very good, however I miss some basic > background of 3 terms I mentioned above. :-) > > regards, > Lin > > On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <[email protected]> > wrote: > > > I could be wrong. I think HFile index block (which is located at the end > >> of HFile) is a binary search tree containing all row-key values (of the > >> HFile) in the binary search tree. Searching a specific row-key in the > >> binary search tree could easily find whether a row-key exists (some > node in > >> the tree has the same row-key value) or not. Why we need load every > block > >> to find if the row exists? > >> > >> > > Hmm... > > It is a multilevel index. Only the root Index's (Data, Meta etc) are > > loaded when a region is opened. The rest of the tree (intermediate and > leaf > > index's) are present in each block level. > > I am assuming a HFile v2 here for the discussion. > > Read this for more clarity http://hbase.apache.org/book/apes03.html > > > > Nice discussion. You made me read lot of things. :-) > > Now i will dig in to the code and check this out. > > > > ./Zahoor > > > -- Thanks & Regards, Anil Gupta
