Re: Time-series schema

Debashis Saha Fri, 29 Oct 2010 13:56:21 -0700

If I understand your problem correctly, Hbase secondary index is the
solution for this. But I doubt the stability of Hbase secondary index as you
may encounter some runtime exception in case of frequent update of an index
row.


Having 2 tables is a good idea to me, as long as your transactional
application can update both table to keep the data integrity. So
the penalty in this case is in one transaction, you are doing double work.
If you have a batch process, you do that in two separate M/R jobs. You don't
have to totally duplicate the data in the second table you are concern about
space; you can keep the this table just for symbols (as rowkey of the 2nd
table) and timestamp (rowkey of the original table). But this way while
retrieving the data, you have to have a 2nd call to the main to get rest of
the data. This is sort of a hbase secondary indexing approach but you more
control over you and less overhead.

if you total duplicate data,
1. you have less I/O getting the info
2. more I/O inserting data

if you create lookup table described above:
1. you have more I/O getting the data
2. less I/O inserting the data



On Fri, Oct 29, 2010 at 10:20 AM, Brian O'Kennedy <[email protected]> wrote:

> Hi Michael,
>
> Thanks for the suggestion. To simplify your example slightly, lets say that
> I'm only interested in a single exchange.
>
> I'd like to be able to (quickly) extract all data (multiple symbols for a
> single timestamp), but I'd also like to (preferably, fairly quickly)
> extract
> all values over all time for a particular symbol (single symbol, multiple
> timestamps).
>
> So, from your description below I believe I can come up with a design that
> does ONE of these two queries very well, but the other very badly. Is there
> a way to have the best of both without having to implement both separately?
>
> And if I do so, do I lose all ability to update this database in an atomic
> fashion? (ie, insert a bunch of new data for some timestamp)
>
> Thanks,
>  Brian
>
>
> On 29 October 2010 15:53, Michael Segel <[email protected]> wrote:
>
> >
> > Brian,
> >
> > I think you have to consider how you're going to use the data when you
> > consider your schema.
> >
> > An example...
> > If we're looking at stock market data you could use the timestamp as your
> > key, and then a column family for each exchange, and then a column for a
> > stock where you store the ask/bid, trade(volume)@price as your value.
> >
> > This is one potential time series schema. However, suppose you want to
> > track all of the IBM trades across all exchanges?
> > You're going to have a harder time of getting the data that you want.
> >
> > You may then want to prefix the key with the stock symbol and instead of
> a
> > column family per exchange, you have a column per exchange.
> > This would tend to co-locate like data. So you can do some range scans.
> >
> > Now either time series schema is valid. But one is going to be more valid
> > if you are looking at data on a per stock basis.
> >
> > Does that make sense?
> >
> >
> > > From: [email protected]
> > > Date: Fri, 29 Oct 2010 10:10:23 +0100
> > > Subject: Time-series schema
> > > To: [email protected]
> > >
> > > Hi,
> > >
> > > I apologise if this has been asked a million times, but after some
> > searching
> > > I'm still not sure if this is a good idea. I've got my local (currently
> > > standalone) server running, Thrift bindings etc and have started
> playing
> > > with schemas.
> > >
> > > I'd like to store a large amount of numeric time-series data using
> > > HBase. The data can be visualised as a 2d array.
> > >
> > > Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100
> million
> > > rows per day)
> > > Column axis is a numeric identifier (in the range of about 20 000
> unique
> > > ids)
> > > Each cell of this array is a small number of values representing some
> > > information for this identifier at this timestamp.
> > >
> > > The array is very sparse, some identifiers will only have one entry per
> > day,
> > > some will have millions. I thought HBase might be a  good fit due to
> the
> > > scaling (I've got many terabytes of data to store) and the built-in
> > > versioning of cells. Occasionally I need to overwrite previous cell
> > values,
> > > but always keep a complete history of previous values to produce
> > > 'point-in-time' views of the dataset.
> > >
> > > My first HBase schema was along the lines of having an row per
> timestamp:
> > >  YYYYMMDD_Milliseconds containing a column family for the identifiers,
> > with
> > > values stored in there.
> > >
> > > This gives me nice and fast lookup by timestamp, but does not work at
> all
> > > for looking up all values for a specific  identifier over all times.
> > Going
> > > back to the 2d array description, I need to be able to slice along rows
> > > (timestamps) or columns (identifiers).
> > >
> > > Any tips as to how achieve something like this using HBase? Am I using
> > the
> > > wrong tool for the job? Am I completely misunderstanding how this all
> > > works?
> > >
> > > Thanks,
> > >   Brian
> >
> >
>



-- 
- DEBASHIS SAHA

2519 Honeysuckle Ln
Rolling Meadows, IL 60008, USA

1-(847) 925 - 5071 (H);
1-(312)-731- 6414 (M)
--~<O>~--

Re: Time-series schema

Reply via email to