Re: HBase Put

Jason Frantz Wed, 22 Aug 2012 14:05:02 -0700

Abhishek,

Setting your column family's bloom filter to ROWCOL will include qualifiers:


http://hbase.apache.org/book.html#schema.bloom

-Jason

On Wed, Aug 22, 2012 at 1:49 PM, Pamecha, Abhishek <[email protected]> wrote:

> Can I enable bloom filters per block at column qualifier levels too? That
> way, will small block sizes, I can selectively load only few data blocks in
> memory. Then I can do some trade off between block size and bloom filter
> false positive rate.
>
> I am designing for a wide table scenario with thousands and millions of
> columns and thus I don't really want to stress on checks for blocks having
> more than one row key.
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: Mohit Anchlia [mailto:[email protected]]
> Sent: Wednesday, August 22, 2012 11:09 AM
> To: [email protected]
> Subject: Re: HBase Put
>
> On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <[email protected]>
> wrote:
>
> > So then a GET query means one needs to look in every HFile where key
> > falls within the min/max range of the file.
> >
> > From another parallel thread, I gather, HFile comprise of blocks
> > which, I think, is an atomic unit of persisted data in HDFS.(please
> correct if not).
> >
> > And that each block for a HFile has a range of keys. My key can
> > satisfy the range for the block and yet may not be present. So, all
> > the blocks that match the range for my key, will need to be scanned.
> > There is one block index per HFile which sorts blocks by key ranges.
> > This index help in reducing the number of blocks to scan by extracting
> > only those blocks whose ranges satisfy the key.
> >
> > In this case, if puts are random wrt order, each block may have
> > similar range and it may turn out that Hbase needs to scan every block
> > for the File. This may not be good for performance.
> >
> > I just want to validate my understanding.
> >
> >
> If you have such a use case I think best practice is to use bloom filters.
> I think in generaly it's a good idea to atleast enable bloom filter at row
> level.
>
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:[email protected]]
> >  Sent: Tuesday, August 21, 2012 5:55 PM
> > To: [email protected]
> > Subject: Re: HBase Put
> >
> > That is correct.
> >
> >
> >
> > ________________________________
> >  From: "Pamecha, Abhishek" <[email protected]>
> > To: "[email protected]" <[email protected]>; lars hofhansl <
> > [email protected]>
> > Sent: Tuesday, August 21, 2012 4:45 PM
> > Subject: RE: HBase Put
> >
> > Hi Lars,
> >
> > Thanks for the explanation. I still have a little doubt:
> >
> > Based on your description, given gets do a merge sort, the data on
> > disk is not kept sorted across files, but just sorted within a file.
> >
> > So, basically if on two separate days, say these keys get inserted:
> >
> > Day1: File1:   A B J M
> > Day2: File2:  C D K P
> >
> > Then each file is sorted within itself, but scanning both files will
> > require Hbase to use merge sort to produce a sorted result. Right?
> >
> > Also, File 1 and File2 are immutable, and during compactions, File 1
> > and
> > File2 are compacted and sorted using merge sort to a bigger File3. Is
> > that correct too?
> >
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:[email protected]]
> > Sent: Tuesday, August 21, 2012 4:07 PM
> > To: [email protected]
> > Subject: Re: HBase Put
> >
> > In a nutshell:
> > - Puts are collected in memory (in a sorted data structure)
> > - When the collected data reaches a certain size it is flushed to a
> > new file (which is sorted)
> > - Gets do a merge sort between the various files that have been
> > created
> > - to contain the number of files they are periodically compacted into
> > fewer, larger files
> >
> >
> > So the data files (HFiles) are immutable once written, changes are
> > batched in memory first.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> > From: "Pamecha, Abhishek" <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Sent: Tuesday, August 21, 2012 4:00 PM
> > Subject: HBase Put
> >
> > Hi
> >
> > I had a  question on Hbase Put call. In the scenario, where data is
> > inserted without any order to column qualifiers, how does Hbase
> > maintain sortedness wrt column qualifiers in its store files/blocks?
> >
> > I checked the code base and I can see checks<
> > https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/
> > org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java#L319>
> > being  made for lexicographic insertions for Key value pairs.  But I
> > cant seem to find out how the key-offset is calculated in the first
> place?
> >
> > Also, given HDFS is by nature, append only, how do randomly ordered
> > keys make their way to sorted order. Is it only during minor/major
> > compactions, that this sortedness gets applied and that there is a
> > small window during which data is not sorted?
> >
> >
> > Thanks,
> > Abhishek
> >
>

Re: HBase Put

Reply via email to