RE: Time-series schema

Michael Segel Fri, 29 Oct 2010 18:12:10 -0700

Hey Debashis!

You're going to hate this answer... it depends on what you're trying to collect 
by the time series and the type of time series.
Also I don't know if I'd recommend the Hbase secondary index since its not 
really being supported as part of the core release but in github.
(Which is really a shame because the concept is really important for a lot of 
use cases.)


Time Series is also kind of a weird thing... you can have a continuous time 
series or a discrete time series. (I may not be using the correct terms, its 
been a while since I've looked at time series... so let me define what I mean.)
A continuous time series captures data at specific points in time. If they 
don't exist, a null is stored. Discrete time series capture the data and a time 
stamp when the data occurs. An example would be stock trading. The trades do 
not occur in a regularly set pattern so you really only want to track the time 
of the trade and then the trade's data.

Also depending on what you want to capture, your index might be large relative 
to the initial event and using a secondary table might be more efficient in 
data retrieval.
(Again it depends on the use case.)

(Outside of time series, I'm a big fan of indexing. Grab me Monday and I'll 
show you some of the ideas behind what we did on LCMS. )

Using a different example, this one a continuous time series, imagine 
collecting telemetry data. If we were just doing raw data collection, indexing 
doesn't make sense. 
If we were also capturing an image, then you would definitely want to use an 
index because you don't want to store that data twice. 

The point of my example was that there isn't a single 'best' answer and you 
will most likely have to choose one way over another and take the penalty of a 
full table scan for your non-primary access path.

HTH

-Mike




> Date: Fri, 29 Oct 2010 15:55:52 -0500
> Subject: Re: Time-series schema
> From: [email protected]
> To: [email protected]
> 
> If I understand your problem correctly, Hbase secondary index is the
> solution for this. But I doubt the stability of Hbase secondary index as you
> may encounter some runtime exception in case of frequent update of an index
> row.
> 
> Having 2 tables is a good idea to me, as long as your transactional
> application can update both table to keep the data integrity. So
> the penalty in this case is in one transaction, you are doing double work.
> If you have a batch process, you do that in two separate M/R jobs. You don't
> have to totally duplicate the data in the second table you are concern about
> space; you can keep the this table just for symbols (as rowkey of the 2nd
> table) and timestamp (rowkey of the original table). But this way while
> retrieving the data, you have to have a 2nd call to the main to get rest of
> the data. This is sort of a hbase secondary indexing approach but you more
> control over you and less overhead.
> 
> if you total duplicate data,
> 1. you have less I/O getting the info
> 2. more I/O inserting data
> 
> if you create lookup table described above:
> 1. you have more I/O getting the data
> 2. less I/O inserting the data
> 
> 
> 
> On Fri, Oct 29, 2010 at 10:20 AM, Brian O'Kennedy <[email protected]> wrote:
> 
> > Hi Michael,
> >
> > Thanks for the suggestion. To simplify your example slightly, lets say that
> > I'm only interested in a single exchange.
> >
> > I'd like to be able to (quickly) extract all data (multiple symbols for a
> > single timestamp), but I'd also like to (preferably, fairly quickly)
> > extract
> > all values over all time for a particular symbol (single symbol, multiple
> > timestamps).
> >
> > So, from your description below I believe I can come up with a design that
> > does ONE of these two queries very well, but the other very badly. Is there
> > a way to have the best of both without having to implement both separately?
> >
> > And if I do so, do I lose all ability to update this database in an atomic
> > fashion? (ie, insert a bunch of new data for some timestamp)
> >
> > Thanks,
> >  Brian
> >
> >
> > On 29 October 2010 15:53, Michael Segel <[email protected]> wrote:
> >
> > >
> > > Brian,
> > >
> > > I think you have to consider how you're going to use the data when you
> > > consider your schema.
> > >
> > > An example...
> > > If we're looking at stock market data you could use the timestamp as your
> > > key, and then a column family for each exchange, and then a column for a
> > > stock where you store the ask/bid, trade(volume)@price as your value.
> > >
> > > This is one potential time series schema. However, suppose you want to
> > > track all of the IBM trades across all exchanges?
> > > You're going to have a harder time of getting the data that you want.
> > >
> > > You may then want to prefix the key with the stock symbol and instead of
> > a
> > > column family per exchange, you have a column per exchange.
> > > This would tend to co-locate like data. So you can do some range scans.
> > >
> > > Now either time series schema is valid. But one is going to be more valid
> > > if you are looking at data on a per stock basis.
> > >
> > > Does that make sense?
> > >
> > >
> > > > From: [email protected]
> > > > Date: Fri, 29 Oct 2010 10:10:23 +0100
> > > > Subject: Time-series schema
> > > > To: [email protected]
> > > >
> > > > Hi,
> > > >
> > > > I apologise if this has been asked a million times, but after some
> > > searching
> > > > I'm still not sure if this is a good idea. I've got my local (currently
> > > > standalone) server running, Thrift bindings etc and have started
> > playing
> > > > with schemas.
> > > >
> > > > I'd like to store a large amount of numeric time-series data using
> > > > HBase. The data can be visualised as a 2d array.
> > > >
> > > > Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100
> > million
> > > > rows per day)
> > > > Column axis is a numeric identifier (in the range of about 20 000
> > unique
> > > > ids)
> > > > Each cell of this array is a small number of values representing some
> > > > information for this identifier at this timestamp.
> > > >
> > > > The array is very sparse, some identifiers will only have one entry per
> > > day,
> > > > some will have millions. I thought HBase might be a  good fit due to
> > the
> > > > scaling (I've got many terabytes of data to store) and the built-in
> > > > versioning of cells. Occasionally I need to overwrite previous cell
> > > values,
> > > > but always keep a complete history of previous values to produce
> > > > 'point-in-time' views of the dataset.
> > > >
> > > > My first HBase schema was along the lines of having an row per
> > timestamp:
> > > >  YYYYMMDD_Milliseconds containing a column family for the identifiers,
> > > with
> > > > values stored in there.
> > > >
> > > > This gives me nice and fast lookup by timestamp, but does not work at
> > all
> > > > for looking up all values for a specific  identifier over all times.
> > > Going
> > > > back to the 2d array description, I need to be able to slice along rows
> > > > (timestamps) or columns (identifiers).
> > > >
> > > > Any tips as to how achieve something like this using HBase? Am I using
> > > the
> > > > wrong tool for the job? Am I completely misunderstanding how this all
> > > > works?
> > > >
> > > > Thanks,
> > > >   Brian
> > >
> > >
> >
> 
> 
> 
> -- 
> - DEBASHIS SAHA
> 
> 2519 Honeysuckle Ln
> Rolling Meadows, IL 60008, USA
> 
> 1-(847) 925 - 5071 (H);
> 1-(312)-731- 6414 (M)
> --~<O>~--

RE: Time-series schema

Reply via email to