Hi,

I apologise if this has been asked a million times, but after some searching
I'm still not sure if this is a good idea. I've got my local (currently
standalone) server running, Thrift bindings etc and have started playing
with schemas.

I'd like to store a large amount of numeric time-series data using
HBase. The data can be visualised as a 2d array.

Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100 million
rows per day)
Column axis is a numeric identifier (in the range of about 20 000 unique
ids)
Each cell of this array is a small number of values representing some
information for this identifier at this timestamp.

The array is very sparse, some identifiers will only have one entry per day,
some will have millions. I thought HBase might be a  good fit due to the
scaling (I've got many terabytes of data to store) and the built-in
versioning of cells. Occasionally I need to overwrite previous cell values,
but always keep a complete history of previous values to produce
'point-in-time' views of the dataset.

My first HBase schema was along the lines of having an row per timestamp:
 YYYYMMDD_Milliseconds containing a column family for the identifiers, with
values stored in there.

This gives me nice and fast lookup by timestamp, but does not work at all
for looking up all values for a specific  identifier over all times. Going
back to the 2d array description, I need to be able to slice along rows
(timestamps) or columns (identifiers).

Any tips as to how achieve something like this using HBase? Am I using the
wrong tool for the job? Am I completely misunderstanding how this all
works?

Thanks,
  Brian

Reply via email to