Re: Strategies for aggregating data in a HBase table

Sam Seigal Wed, 30 Nov 2011 13:35:27 -0800

What about "partitioning" at a table level. For example, create 12
tables for the given year. Design the row keys however you like, let's
say using SHA/MD hashes. Place transactions in the appropriate table
and then do aggregations based on that table alone (this is assuming
you won't get transactions with timestamps in the past going back a
month). The idea is to archive the tables for a given year and start
fresh the next. This is acceptable in my use case. I am in the process
of trying this out, so do not have any performance numbers, issues yet
... Experts can comment.


On a further note, having HBase support this natively i.e. one more
level of partitioning above the row key , but below a table can be
beneficial for use cases like these ones. Comments ... ?

On Wed, Nov 30, 2011 at 11:53 AM, Jean-Daniel Cryans
<[email protected]> wrote:
> Inline.
>
> J-D
>
> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
> <[email protected]> wrote:
>> Hello,
>> ...
>>
>> While it is an option processing the entire HBase table e.g. every night
>> when we go live, it probably isn't an option when data volume grows over
>> the years. So, what options are there for some kind of incremental
>> aggregating only new data?
>
> Yeah you don't want to go there.
>
>>
>> - Perhaps using versioning (internal timestamp) might be an option?
>
> I guess you could do rollups and ditch the raw data, if you don't need it.
>
>>
>> - Perhaps having some kind of HBase (daily) staging table which is
>> truncated after aggregating data is an option?
>
> If you do the aggregations nightly then you won't have "access to
> aggregated data very quickly".
>
>>
>> - How could Co-processors help here (at the time of the Go-Live, they
>> might be available in e.g. Cloudera)?
>
> Coprocessors are more like an internal HBase tool, so don't put all
> your eggs there until you play with them. What you could do is get the
> 0.92.0 RC0 tarball and try them out :)
>
>> Any ideas/comments are appreciated.
>
> Normally data is stored in a way that's not easy to query in a batch
> or analytics mode, so an ETL step is introduced. You'll probably need
> to do the same, as in you could asynchronously stream your data to
> other HBase tables or Hive or Pig via logs or replication and then
> directly insert it into the format it needs to be or stage it for
> later aggregations. If you explore those avenues I'm sure you'll find
> concepts that are very very similar to those you listed regarding
> RDBMS.
>
> You could also keep live counts using atomic increments, you'd issue
> those at write time or async.
>
> Hope this helps,
>
> J-D

Re: Strategies for aggregating data in a HBase table

Reply via email to