Inline. J-D
On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas <[email protected]> wrote: > Hello, > ... > > While it is an option processing the entire HBase table e.g. every night > when we go live, it probably isn't an option when data volume grows over > the years. So, what options are there for some kind of incremental > aggregating only new data? Yeah you don't want to go there. > > - Perhaps using versioning (internal timestamp) might be an option? I guess you could do rollups and ditch the raw data, if you don't need it. > > - Perhaps having some kind of HBase (daily) staging table which is > truncated after aggregating data is an option? If you do the aggregations nightly then you won't have "access to aggregated data very quickly". > > - How could Co-processors help here (at the time of the Go-Live, they > might be available in e.g. Cloudera)? Coprocessors are more like an internal HBase tool, so don't put all your eggs there until you play with them. What you could do is get the 0.92.0 RC0 tarball and try them out :) > Any ideas/comments are appreciated. Normally data is stored in a way that's not easy to query in a batch or analytics mode, so an ETL step is introduced. You'll probably need to do the same, as in you could asynchronously stream your data to other HBase tables or Hive or Pig via logs or replication and then directly insert it into the format it needs to be or stage it for later aggregations. If you explore those avenues I'm sure you'll find concepts that are very very similar to those you listed regarding RDBMS. You could also keep live counts using atomic increments, you'd issue those at write time or async. Hope this helps, J-D
