Hi, I apologise if this has been asked a million times, but after some searching I'm still not sure if this is a good idea. I've got my local (currently standalone) server running, Thrift bindings etc and have started playing with schemas.
I'd like to store a large amount of numeric time-series data using HBase. The data can be visualised as a 2d array. Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100 million rows per day) Column axis is a numeric identifier (in the range of about 20 000 unique ids) Each cell of this array is a small number of values representing some information for this identifier at this timestamp. The array is very sparse, some identifiers will only have one entry per day, some will have millions. I thought HBase might be a good fit due to the scaling (I've got many terabytes of data to store) and the built-in versioning of cells. Occasionally I need to overwrite previous cell values, but always keep a complete history of previous values to produce 'point-in-time' views of the dataset. My first HBase schema was along the lines of having an row per timestamp: YYYYMMDD_Milliseconds containing a column family for the identifiers, with values stored in there. This gives me nice and fast lookup by timestamp, but does not work at all for looking up all values for a specific identifier over all times. Going back to the 2d array description, I need to be able to slice along rows (timestamps) or columns (identifiers). Any tips as to how achieve something like this using HBase? Am I using the wrong tool for the job? Am I completely misunderstanding how this all works? Thanks, Brian
