RE: Time-series schema

Michael Segel Fri, 29 Oct 2010 17:50:51 -0700

Brian,

That's exactly my point. 
In both examples, the main key works really well for one or the other.


So when considering your schema, you have to consider your predominant path of 
access.

To keep it simple you could just limit yourself to one exchange, like NYSE. 

I saw in your other post that you thought about storing the data in two 
different tables in two different formats. For narrow sets of data, its 
possible, however you then run in to an issue of which is really the system of 
record. (You have two tables representing the same data set and if one gets 
updated, and out of sync with the other, which one is your authoritative 
record? (Again you have to decide and you can always write a script to convert 
your authoritative records in to your other table's format.)

With respect to timestamps... funny thing. You can get  
System.currentTimeMillis() and use this to set your input timestamp so that you 
can write to both tables with the same time stamp.

-Mike


> From: [email protected]
> Date: Fri, 29 Oct 2010 16:20:31 +0100
> Subject: Re: Time-series schema
> To: [email protected]
> 
> Hi Michael,
> 
> Thanks for the suggestion. To simplify your example slightly, lets say that
> I'm only interested in a single exchange.
> 
> I'd like to be able to (quickly) extract all data (multiple symbols for a
> single timestamp), but I'd also like to (preferably, fairly quickly) extract
> all values over all time for a particular symbol (single symbol, multiple
> timestamps).
> 
> So, from your description below I believe I can come up with a design that
> does ONE of these two queries very well, but the other very badly. Is there
> a way to have the best of both without having to implement both separately?
> 
> And if I do so, do I lose all ability to update this database in an atomic
> fashion? (ie, insert a bunch of new data for some timestamp)
> 
> Thanks,
>  Brian
> 
> 
> On 29 October 2010 15:53, Michael Segel <[email protected]> wrote:
> 
> >
> > Brian,
> >
> > I think you have to consider how you're going to use the data when you
> > consider your schema.
> >
> > An example...
> > If we're looking at stock market data you could use the timestamp as your
> > key, and then a column family for each exchange, and then a column for a
> > stock where you store the ask/bid, trade(volume)@price as your value.
> >
> > This is one potential time series schema. However, suppose you want to
> > track all of the IBM trades across all exchanges?
> > You're going to have a harder time of getting the data that you want.
> >
> > You may then want to prefix the key with the stock symbol and instead of a
> > column family per exchange, you have a column per exchange.
> > This would tend to co-locate like data. So you can do some range scans.
> >
> > Now either time series schema is valid. But one is going to be more valid
> > if you are looking at data on a per stock basis.
> >
> > Does that make sense?
> >
> >
> > > From: [email protected]
> > > Date: Fri, 29 Oct 2010 10:10:23 +0100
> > > Subject: Time-series schema
> > > To: [email protected]
> > >
> > > Hi,
> > >
> > > I apologise if this has been asked a million times, but after some
> > searching
> > > I'm still not sure if this is a good idea. I've got my local (currently
> > > standalone) server running, Thrift bindings etc and have started playing
> > > with schemas.
> > >
> > > I'd like to store a large amount of numeric time-series data using
> > > HBase. The data can be visualised as a 2d array.
> > >
> > > Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100 million
> > > rows per day)
> > > Column axis is a numeric identifier (in the range of about 20 000 unique
> > > ids)
> > > Each cell of this array is a small number of values representing some
> > > information for this identifier at this timestamp.
> > >
> > > The array is very sparse, some identifiers will only have one entry per
> > day,
> > > some will have millions. I thought HBase might be a  good fit due to the
> > > scaling (I've got many terabytes of data to store) and the built-in
> > > versioning of cells. Occasionally I need to overwrite previous cell
> > values,
> > > but always keep a complete history of previous values to produce
> > > 'point-in-time' views of the dataset.
> > >
> > > My first HBase schema was along the lines of having an row per timestamp:
> > >  YYYYMMDD_Milliseconds containing a column family for the identifiers,
> > with
> > > values stored in there.
> > >
> > > This gives me nice and fast lookup by timestamp, but does not work at all
> > > for looking up all values for a specific  identifier over all times.
> > Going
> > > back to the 2d array description, I need to be able to slice along rows
> > > (timestamps) or columns (identifiers).
> > >
> > > Any tips as to how achieve something like this using HBase? Am I using
> > the
> > > wrong tool for the job? Am I completely misunderstanding how this all
> > > works?
> > >
> > > Thanks,
> > >   Brian
> >
> >

RE: Time-series schema

Reply via email to