Re: Disk Seeks and Column families

Praveen Sripati Mon, 23 Jan 2012 22:15:54 -0800

Thanks for the response. I am just getting started with HBase. And before
getting into the code/api level details, I am trying to understand the
problem area HBase is trying to address through it's architecture/design.


1) So, what are the recommendations for having many columns and with dense
data? Is HBase not the right tool?

2) Also, if the data for a column is spread wide across blocks and maybe
even across nodes how will HBase help in aggregation?

3) Also, about storing data using an incremental row key, initially there
will be a hot stop with the data getting to a single region. Even after a
split of the region into two, the first one won't be getting any data (in
incremental row key) and the second one will be hammered.

One of the approach to alleviate this is not to insert incremental row keys
from the client and have the row keys scattered for better load balancing.
But, this approach is not efficient if I want to get events in a time
sequence, in which case I have to use some filters to scan the entire data.

4) Still not clear why I can't have 10 column families in HBase and why
only 2 or 3 of them according to this link (1)?

(1) - http://hbase.apache.org/book/number.of.cfs.html

Praveen

On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <[email protected]> wrote:

> Praveen,
>
>  basically you are correct on all counts. If there are too many columns,
>  HBase will have to issue more disk-seeks  to extract only the particular
> columns you need ... and since the data is laid out horizontally there are
> fewer common substrings in a single HBase-block and compression quality
> starts to degrade due to reduced redundancy.
>
>
> On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
> <[email protected]>wrote:
>
> > Thanks for the response.
> >
> > > The contents of a row stay together like a regular row-oriented
> database.
> >
> > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> >
> > Is the above statement true for a HFile?
> >
> > Also from the above example, the data for the column family qualifier are
> > not adjacent to take advantage of compression (
> > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this
> a
> > proper statement?
> >
> > Regards,
> > Praveen
> >
> > On Sat, Jan 21, 2012 at 9:03 PM, <[email protected]> wrote:
> >
> > > Have you considered using AggregationProtocol to perform aggregation ?
> > >
> > > Thanks
> > >
> > >
> > >
> > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
> [email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > 1) According to the this url (1), HBase performs well for two or
> three
> > > > column families. Why is it so?
> > > >
> > > > 2) Dump of a HFile, looks like below. The contents of a row stay
> > together
> > > > like a regular row-oriented database. If the column family has 100
> > column
> > > > family qualifiers and is dense then the data for a particular column
> > > family
> > > > qualifier is spread wide. If I want to do an aggregation on a
> > particular
> > > > column identifier, the disk seeks doesn't seems to be much better
> than
> > a
> > > > regular row-oriented database.
> > > >
> > > > Please correct me if I am wrong.
> > > >
> > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> > > >
> > > > (1) - http://hbase.apache.org/book/number.of.cfs.html
> > > >
> > > > Thanks,
> > > > Praveen
> > >
> >
>

Re: Disk Seeks and Column families

Reply via email to