On Fri, Oct 29, 2010 at 8:20 AM, Brian O'Kennedy <[email protected]> wrote:

> Hi Michael,
>
> Thanks for the suggestion. To simplify your example slightly, lets say that
> I'm only interested in a single exchange.
>
> I'd like to be able to (quickly) extract all data (multiple symbols for a
> single timestamp), but I'd also like to (preferably, fairly quickly)
> extract
> all values over all time for a particular symbol (single symbol, multiple
> timestamps).
>
> So, from your description below I believe I can come up with a design that
> does ONE of these two queries very well, but the other very badly. Is there
> a way to have the best of both without having to implement both separately?
>
> And if I do so, do I lose all ability to update this database in an atomic
> fashion? (ie, insert a bunch of new data for some timestamp)
>
There is no such atomicity provided by HBase. Recent TableIndexed may help,
but I have not personally tried it.


>
> Thanks,
>  Brian
>
>
> On 29 October 2010 15:53, Michael Segel <[email protected]> wrote:
>
> >
> > Brian,
> >
> > I think you have to consider how you're going to use the data when you
> > consider your schema.
> >
> > An example...
> > If we're looking at stock market data you could use the timestamp as your
> > key, and then a column family for each exchange, and then a column for a
> > stock where you store the ask/bid, trade(volume)@price as your value.
> >
> > This is one potential time series schema. However, suppose you want to
> > track all of the IBM trades across all exchanges?
> > You're going to have a harder time of getting the data that you want.
> >
> > You may then want to prefix the key with the stock symbol and instead of
> a
> > column family per exchange, you have a column per exchange.
> > This would tend to co-locate like data. So you can do some range scans.
> >
> > Now either time series schema is valid. But one is going to be more valid
> > if you are looking at data on a per stock basis.
> >
> > Does that make sense?
> >
> >
> > > From: [email protected]
> > > Date: Fri, 29 Oct 2010 10:10:23 +0100
> > > Subject: Time-series schema
> > > To: [email protected]
> > >
> > > Hi,
> > >
> > > I apologise if this has been asked a million times, but after some
> > searching
> > > I'm still not sure if this is a good idea. I've got my local (currently
> > > standalone) server running, Thrift bindings etc and have started
> playing
> > > with schemas.
> > >
> > > I'd like to store a large amount of numeric time-series data using
> > > HBase. The data can be visualised as a 2d array.
> > >
> > > Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100
> million
> > > rows per day)
> > > Column axis is a numeric identifier (in the range of about 20 000
> unique
> > > ids)
> > > Each cell of this array is a small number of values representing some
> > > information for this identifier at this timestamp.
> > >
> > > The array is very sparse, some identifiers will only have one entry per
> > day,
> > > some will have millions. I thought HBase might be a  good fit due to
> the
> > > scaling (I've got many terabytes of data to store) and the built-in
> > > versioning of cells. Occasionally I need to overwrite previous cell
> > values,
> > > but always keep a complete history of previous values to produce
> > > 'point-in-time' views of the dataset.
> > >
> > > My first HBase schema was along the lines of having an row per
> timestamp:
> > >  YYYYMMDD_Milliseconds containing a column family for the identifiers,
> > with
> > > values stored in there.
> > >
> > > This gives me nice and fast lookup by timestamp, but does not work at
> all
> > > for looking up all values for a specific  identifier over all times.
> > Going
> > > back to the 2d array description, I need to be able to slice along rows
> > > (timestamps) or columns (identifiers).
> > >
> > > Any tips as to how achieve something like this using HBase? Am I using
> > the
> > > wrong tool for the job? Am I completely misunderstanding how this all
> > > works?
> > >
> > > Thanks,
> > >   Brian
> >
> >
>

Reply via email to