Hi, I'm currently playing with doing just that: storing the data twice, once in the per-row-table and once in the per-column-table. The downside I see with this is that I cannot do this as a single transaction any more, since my transaction is now split across two tables. (or I understand hbase transactions incorrectly!)
Thanks, Brian On 29 October 2010 15:39, Erik Holstad <[email protected]> wrote: > Hey Brian! > One thing that you could do to accommodate the second query is to do > another > write. > Either setting up an index that points you back to the original row, or > just > putting the data > in there a second time with the specific identifier as the row key and then > the timestamps > as column. > > Erik > > On Fri, Oct 29, 2010 at 2:10 AM, Brian O'Kennedy <[email protected]> > wrote: > > > Hi, > > > > I apologise if this has been asked a million times, but after some > > searching > > I'm still not sure if this is a good idea. I've got my local (currently > > standalone) server running, Thrift bindings etc and have started playing > > with schemas. > > > > I'd like to store a large amount of numeric time-series data using > > HBase. The data can be visualised as a 2d array. > > > > Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100 million > > rows per day) > > Column axis is a numeric identifier (in the range of about 20 000 unique > > ids) > > Each cell of this array is a small number of values representing some > > information for this identifier at this timestamp. > > > > The array is very sparse, some identifiers will only have one entry per > > day, > > some will have millions. I thought HBase might be a good fit due to the > > scaling (I've got many terabytes of data to store) and the built-in > > versioning of cells. Occasionally I need to overwrite previous cell > values, > > but always keep a complete history of previous values to produce > > 'point-in-time' views of the dataset. > > > > My first HBase schema was along the lines of having an row per timestamp: > > YYYYMMDD_Milliseconds containing a column family for the identifiers, > with > > values stored in there. > > > > This gives me nice and fast lookup by timestamp, but does not work at all > > for looking up all values for a specific identifier over all times. > Going > > back to the 2d array description, I need to be able to slice along rows > > (timestamps) or columns (identifiers). > > > > Any tips as to how achieve something like this using HBase? Am I using > the > > wrong tool for the job? Am I completely misunderstanding how this all > > works? > > > > Thanks, > > Brian > > > > > > -- > Regards Erik >
