RE: Time-series schema

Michael Segel Fri, 29 Oct 2010 07:54:00 -0700

Brian,

I think you have to consider how you're going to use the data when you consider 
your schema.


An example... 
If we're looking at stock market data you could use the timestamp as your key, 
and then a column family for each exchange, and then a column for a stock where 
you store the ask/bid, trade(volume)@price as your value.

This is one potential time series schema. However, suppose you want to track 
all of the IBM trades across all exchanges?  
You're going to have a harder time of getting the data that you want.

You may then want to prefix the key with the stock symbol and instead of a 
column family per exchange, you have a column per exchange.
This would tend to co-locate like data. So you can do some range scans. 

Now either time series schema is valid. But one is going to be more valid if 
you are looking at data on a per stock basis.

Does that make sense?


> From: [email protected]
> Date: Fri, 29 Oct 2010 10:10:23 +0100
> Subject: Time-series schema
> To: [email protected]
> 
> Hi,
> 
> I apologise if this has been asked a million times, but after some searching
> I'm still not sure if this is a good idea. I've got my local (currently
> standalone) server running, Thrift bindings etc and have started playing
> with schemas.
> 
> I'd like to store a large amount of numeric time-series data using
> HBase. The data can be visualised as a 2d array.
> 
> Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100 million
> rows per day)
> Column axis is a numeric identifier (in the range of about 20 000 unique
> ids)
> Each cell of this array is a small number of values representing some
> information for this identifier at this timestamp.
> 
> The array is very sparse, some identifiers will only have one entry per day,
> some will have millions. I thought HBase might be a  good fit due to the
> scaling (I've got many terabytes of data to store) and the built-in
> versioning of cells. Occasionally I need to overwrite previous cell values,
> but always keep a complete history of previous values to produce
> 'point-in-time' views of the dataset.
> 
> My first HBase schema was along the lines of having an row per timestamp:
>  YYYYMMDD_Milliseconds containing a column family for the identifiers, with
> values stored in there.
> 
> This gives me nice and fast lookup by timestamp, but does not work at all
> for looking up all values for a specific  identifier over all times. Going
> back to the 2d array description, I need to be able to slice along rows
> (timestamps) or columns (identifiers).
> 
> Any tips as to how achieve something like this using HBase? Am I using the
> wrong tool for the job? Am I completely misunderstanding how this all
> works?
> 
> Thanks,
>   Brian

RE: Time-series schema

Reply via email to