The data will be accessed both by MR jobs (if possible, via Hive, using HBaseStorageHandler), and randomly via the REST API. The rows won't be too big.
Ideally, I would like to store lists of attributes for every row key (example: store lists of visitors to a set of URLs, URL being the row key). Thus, one option is to create an insertion scheme where for every row key, new data are appended to the existing list. This makes retrievals straightforward. The second option is to store new data in separate rows by making timestamp part of the row key, and scan through a set of rows on retrieval. This makes insertions easy, but would row scans be fast enough for random accesses via REST API? Third option is to store new data in a different family, i.e. making timestamp a family qualifier. I'm not sure what drawbacks that entails... Retrieving data that's been accumulating over time seems like a pretty common use pattern; I'm a little surprised that I couldn't easily find guidelines or descriptions of possible trade-offs... --Leo On Mon, Nov 1, 2010 at 7:17 AM, Michael Segel <[email protected]> wrote: > > Best? That's pretty subjective. > > How are you planning on accessing the data? > Since you don't want to overwrite the data you can't really rely on the > timestamps. > (Or is the updated data a replacement?) > > Depending on the data size and structure you could append to the same column > family, column (record) You could create a new column and insert the data > there. > > Not sure which would be best, it would depend on how you want to access the > data. > >> Date: Mon, 1 Nov 2010 02:28:31 -0700 >> Subject: Best strategy for row updates >> From: [email protected] >> To: [email protected] >> >> We are populating some HBase tables from daily data streams that are >> stored in Hive. When we see a row key that's already in the table, >> the data should be appended to that row's record. What is the best >> way to achieve this?.. Should we be using the Java API?.. Rely on >> HBase cell timestamping?.. Create compound keys (row_id+date) and >> periodically run a separate MR job to coalesce all the data belonging >> to the same row_id?.. >> >> Any pointers greatly appreciated! >> >> --Leo >
