Re: Best strategy for row updates

Leo Alekseyev Mon, 01 Nov 2010 13:10:14 -0700

The data will be accessed both by MR jobs (if possible, via Hive,
using HBaseStorageHandler), and randomly via the REST API.  The rows
won't be too big.

Ideally, I would like to store lists of attributes for every row key
(example: store lists of visitors to a set of URLs, URL being the row
key).  Thus, one option is to create an insertion scheme where for
every row key, new data are appended to the existing list.  This makes
retrievals straightforward.

The second option is to store new data in separate rows by making
timestamp part of the row key, and scan through a set of rows on
retrieval.  This makes insertions easy, but would row scans be fast
enough for random accesses via REST API?

Third option is to store new data in a different family, i.e. making
timestamp a family qualifier.  I'm not sure what drawbacks that
entails...

Retrieving data that's been accumulating over time seems like a pretty
common use pattern; I'm a little surprised that I couldn't easily find
guidelines or descriptions of possible trade-offs...

--Leo

On Mon, Nov 1, 2010 at 7:17 AM, Michael Segel <[email protected]> wrote:
>
> Best? That's pretty subjective.
>
> How are you planning on accessing the data?
> Since you don't want to overwrite the data you can't really rely on the 
> timestamps.
> (Or is the updated data a replacement?)
>
> Depending on the data size and structure you could append to the same column 
> family, column (record) You could create a new column and insert the data 
> there.
>
> Not sure which would be best, it would depend on how you want to access the 
> data.
>
>> Date: Mon, 1 Nov 2010 02:28:31 -0700
>> Subject: Best strategy for row updates
>> From: [email protected]
>> To: [email protected]
>>
>> We are populating some HBase tables from daily data streams that are
>> stored in Hive.  When we see a row key that's already in the table,
>> the data should be appended to that row's record.  What is the best
>> way to achieve this?..  Should we be using the Java API?..  Rely on
>> HBase cell timestamping?..  Create compound keys (row_id+date) and
>> periodically run a separate MR job to coalesce all the data belonging
>> to the same row_id?..
>>
>> Any pointers greatly appreciated!
>>
>> --Leo
>

Re: Best strategy for row updates

Reply via email to