Abhishek, Setting your column family's bloom filter to ROWCOL will include qualifiers:
http://hbase.apache.org/book.html#schema.bloom -Jason On Wed, Aug 22, 2012 at 1:49 PM, Pamecha, Abhishek <[email protected]> wrote: > Can I enable bloom filters per block at column qualifier levels too? That > way, will small block sizes, I can selectively load only few data blocks in > memory. Then I can do some trade off between block size and bloom filter > false positive rate. > > I am designing for a wide table scenario with thousands and millions of > columns and thus I don't really want to stress on checks for blocks having > more than one row key. > > Thanks, > Abhishek > > > -----Original Message----- > From: Mohit Anchlia [mailto:[email protected]] > Sent: Wednesday, August 22, 2012 11:09 AM > To: [email protected] > Subject: Re: HBase Put > > On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <[email protected]> > wrote: > > > So then a GET query means one needs to look in every HFile where key > > falls within the min/max range of the file. > > > > From another parallel thread, I gather, HFile comprise of blocks > > which, I think, is an atomic unit of persisted data in HDFS.(please > correct if not). > > > > And that each block for a HFile has a range of keys. My key can > > satisfy the range for the block and yet may not be present. So, all > > the blocks that match the range for my key, will need to be scanned. > > There is one block index per HFile which sorts blocks by key ranges. > > This index help in reducing the number of blocks to scan by extracting > > only those blocks whose ranges satisfy the key. > > > > In this case, if puts are random wrt order, each block may have > > similar range and it may turn out that Hbase needs to scan every block > > for the File. This may not be good for performance. > > > > I just want to validate my understanding. > > > > > If you have such a use case I think best practice is to use bloom filters. > I think in generaly it's a good idea to atleast enable bloom filter at row > level. > > > Thanks, > > Abhishek > > > > > > -----Original Message----- > > From: lars hofhansl [mailto:[email protected]] > > Sent: Tuesday, August 21, 2012 5:55 PM > > To: [email protected] > > Subject: Re: HBase Put > > > > That is correct. > > > > > > > > ________________________________ > > From: "Pamecha, Abhishek" <[email protected]> > > To: "[email protected]" <[email protected]>; lars hofhansl < > > [email protected]> > > Sent: Tuesday, August 21, 2012 4:45 PM > > Subject: RE: HBase Put > > > > Hi Lars, > > > > Thanks for the explanation. I still have a little doubt: > > > > Based on your description, given gets do a merge sort, the data on > > disk is not kept sorted across files, but just sorted within a file. > > > > So, basically if on two separate days, say these keys get inserted: > > > > Day1: File1: A B J M > > Day2: File2: C D K P > > > > Then each file is sorted within itself, but scanning both files will > > require Hbase to use merge sort to produce a sorted result. Right? > > > > Also, File 1 and File2 are immutable, and during compactions, File 1 > > and > > File2 are compacted and sorted using merge sort to a bigger File3. Is > > that correct too? > > > > Thanks, > > Abhishek > > > > > > -----Original Message----- > > From: lars hofhansl [mailto:[email protected]] > > Sent: Tuesday, August 21, 2012 4:07 PM > > To: [email protected] > > Subject: Re: HBase Put > > > > In a nutshell: > > - Puts are collected in memory (in a sorted data structure) > > - When the collected data reaches a certain size it is flushed to a > > new file (which is sorted) > > - Gets do a merge sort between the various files that have been > > created > > - to contain the number of files they are periodically compacted into > > fewer, larger files > > > > > > So the data files (HFiles) are immutable once written, changes are > > batched in memory first. > > > > -- Lars > > > > > > > > ________________________________ > > From: "Pamecha, Abhishek" <[email protected]> > > To: "[email protected]" <[email protected]> > > Sent: Tuesday, August 21, 2012 4:00 PM > > Subject: HBase Put > > > > Hi > > > > I had a question on Hbase Put call. In the scenario, where data is > > inserted without any order to column qualifiers, how does Hbase > > maintain sortedness wrt column qualifiers in its store files/blocks? > > > > I checked the code base and I can see checks< > > https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/ > > org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319> > > being made for lexicographic insertions for Key value pairs. But I > > cant seem to find out how the key-offset is calculated in the first > place? > > > > Also, given HDFS is by nature, append only, how do randomly ordered > > keys make their way to sorted order. Is it only during minor/major > > compactions, that this sortedness gets applied and that there is a > > small window during which data is not sorted? > > > > > > Thanks, > > Abhishek > > >
