What about "partitioning" at a table level. For example, create 12 tables for the given year. Design the row keys however you like, let's say using SHA/MD hashes. Place transactions in the appropriate table and then do aggregations based on that table alone (this is assuming you won't get transactions with timestamps in the past going back a month). The idea is to archive the tables for a given year and start fresh the next. This is acceptable in my use case. I am in the process of trying this out, so do not have any performance numbers, issues yet ... Experts can comment.
On a further note, having HBase support this natively i.e. one more level of partitioning above the row key , but below a table can be beneficial for use cases like these ones. Comments ... ? On Wed, Nov 30, 2011 at 11:53 AM, Jean-Daniel Cryans <[email protected]> wrote: > Inline. > > J-D > > On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas > <[email protected]> wrote: >> Hello, >> ... >> >> While it is an option processing the entire HBase table e.g. every night >> when we go live, it probably isn't an option when data volume grows over >> the years. So, what options are there for some kind of incremental >> aggregating only new data? > > Yeah you don't want to go there. > >> >> - Perhaps using versioning (internal timestamp) might be an option? > > I guess you could do rollups and ditch the raw data, if you don't need it. > >> >> - Perhaps having some kind of HBase (daily) staging table which is >> truncated after aggregating data is an option? > > If you do the aggregations nightly then you won't have "access to > aggregated data very quickly". > >> >> - How could Co-processors help here (at the time of the Go-Live, they >> might be available in e.g. Cloudera)? > > Coprocessors are more like an internal HBase tool, so don't put all > your eggs there until you play with them. What you could do is get the > 0.92.0 RC0 tarball and try them out :) > >> Any ideas/comments are appreciated. > > Normally data is stored in a way that's not easy to query in a batch > or analytics mode, so an ETL step is introduced. You'll probably need > to do the same, as in you could asynchronously stream your data to > other HBase tables or Hive or Pig via logs or replication and then > directly insert it into the format it needs to be or stage it for > later aggregations. If you explore those avenues I'm sure you'll find > concepts that are very very similar to those you listed regarding > RDBMS. > > You could also keep live counts using atomic increments, you'd issue > those at write time or async. > > Hope this helps, > > J-D
