Re: Strategies for aggregating data in a HBase table

Jean-Daniel Cryans Wed, 30 Nov 2011 11:54:20 -0800

Inline.

J-D

On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
<[email protected]> wrote:
> Hello,
> ...
>
> While it is an option processing the entire HBase table e.g. every night
> when we go live, it probably isn't an option when data volume grows over
> the years. So, what options are there for some kind of incremental
> aggregating only new data?

Yeah you don't want to go there.

>
> - Perhaps using versioning (internal timestamp) might be an option?

I guess you could do rollups and ditch the raw data, if you don't need it.

>
> - Perhaps having some kind of HBase (daily) staging table which is
> truncated after aggregating data is an option?

If you do the aggregations nightly then you won't have "access to
aggregated data very quickly".

>
> - How could Co-processors help here (at the time of the Go-Live, they
> might be available in e.g. Cloudera)?

Coprocessors are more like an internal HBase tool, so don't put all
your eggs there until you play with them. What you could do is get the
0.92.0 RC0 tarball and try them out :)

> Any ideas/comments are appreciated.

Normally data is stored in a way that's not easy to query in a batch
or analytics mode, so an ETL step is introduced. You'll probably need
to do the same, as in you could asynchronously stream your data to
other HBase tables or Hive or Pig via logs or replication and then
directly insert it into the format it needs to be or stage it for
later aggregations. If you explore those avenues I'm sure you'll find
concepts that are very very similar to those you listed regarding
RDBMS.

You could also keep live counts using atomic increments, you'd issue
those at write time or async.

Hope this helps,

J-D

Re: Strategies for aggregating data in a HBase table

Reply via email to