Re: Time-series schema

Brian O'Kennedy Fri, 29 Oct 2010 08:15:34 -0700

Hi,

I'm currently playing with doing just that: storing the data twice, once in
the per-row-table and once in the per-column-table. The downside I see with
this is that I cannot do this as a single transaction any more, since my
transaction is now split across two tables. (or I understand hbase
transactions incorrectly!)


Thanks,
 Brian


On 29 October 2010 15:39, Erik Holstad <[email protected]> wrote:

> Hey Brian!
> One thing that you could do to accommodate the second query is to do
> another
> write.
> Either setting up an index that points you back to the original row, or
> just
> putting the data
> in there a second time with the specific identifier as the row key and then
> the timestamps
> as column.
>
> Erik
>
> On Fri, Oct 29, 2010 at 2:10 AM, Brian O'Kennedy <[email protected]>
> wrote:
>
> > Hi,
> >
> > I apologise if this has been asked a million times, but after some
> > searching
> > I'm still not sure if this is a good idea. I've got my local (currently
> > standalone) server running, Thrift bindings etc and have started playing
> > with schemas.
> >
> > I'd like to store a large amount of numeric time-series data using
> > HBase. The data can be visualised as a 2d array.
> >
> > Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100 million
> > rows per day)
> > Column axis is a numeric identifier (in the range of about 20 000 unique
> > ids)
> > Each cell of this array is a small number of values representing some
> > information for this identifier at this timestamp.
> >
> > The array is very sparse, some identifiers will only have one entry per
> > day,
> > some will have millions. I thought HBase might be a  good fit due to the
> > scaling (I've got many terabytes of data to store) and the built-in
> > versioning of cells. Occasionally I need to overwrite previous cell
> values,
> > but always keep a complete history of previous values to produce
> > 'point-in-time' views of the dataset.
> >
> > My first HBase schema was along the lines of having an row per timestamp:
> >  YYYYMMDD_Milliseconds containing a column family for the identifiers,
> with
> > values stored in there.
> >
> > This gives me nice and fast lookup by timestamp, but does not work at all
> > for looking up all values for a specific  identifier over all times.
> Going
> > back to the 2d array description, I need to be able to slice along rows
> > (timestamps) or columns (identifiers).
> >
> > Any tips as to how achieve something like this using HBase? Am I using
> the
> > wrong tool for the job? Am I completely misunderstanding how this all
> > works?
> >
> > Thanks,
> >   Brian
> >
>
>
>
> --
> Regards Erik
>

Re: Time-series schema

Reply via email to