On Fri, Oct 29, 2010 at 8:20 AM, Brian O'Kennedy <[email protected]> wrote:
> Hi Michael, > > Thanks for the suggestion. To simplify your example slightly, lets say that > I'm only interested in a single exchange. > > I'd like to be able to (quickly) extract all data (multiple symbols for a > single timestamp), but I'd also like to (preferably, fairly quickly) > extract > all values over all time for a particular symbol (single symbol, multiple > timestamps). > > So, from your description below I believe I can come up with a design that > does ONE of these two queries very well, but the other very badly. Is there > a way to have the best of both without having to implement both separately? > > And if I do so, do I lose all ability to update this database in an atomic > fashion? (ie, insert a bunch of new data for some timestamp) > There is no such atomicity provided by HBase. Recent TableIndexed may help, but I have not personally tried it. > > Thanks, > Brian > > > On 29 October 2010 15:53, Michael Segel <[email protected]> wrote: > > > > > Brian, > > > > I think you have to consider how you're going to use the data when you > > consider your schema. > > > > An example... > > If we're looking at stock market data you could use the timestamp as your > > key, and then a column family for each exchange, and then a column for a > > stock where you store the ask/bid, trade(volume)@price as your value. > > > > This is one potential time series schema. However, suppose you want to > > track all of the IBM trades across all exchanges? > > You're going to have a harder time of getting the data that you want. > > > > You may then want to prefix the key with the stock symbol and instead of > a > > column family per exchange, you have a column per exchange. > > This would tend to co-locate like data. So you can do some range scans. > > > > Now either time series schema is valid. But one is going to be more valid > > if you are looking at data on a per stock basis. > > > > Does that make sense? > > > > > > > From: [email protected] > > > Date: Fri, 29 Oct 2010 10:10:23 +0100 > > > Subject: Time-series schema > > > To: [email protected] > > > > > > Hi, > > > > > > I apologise if this has been asked a million times, but after some > > searching > > > I'm still not sure if this is a good idea. I've got my local (currently > > > standalone) server running, Thrift bindings etc and have started > playing > > > with schemas. > > > > > > I'd like to store a large amount of numeric time-series data using > > > HBase. The data can be visualised as a 2d array. > > > > > > Row-axis is timestamp (YYYYMMDD_Milliseconds) (between 1 and 100 > million > > > rows per day) > > > Column axis is a numeric identifier (in the range of about 20 000 > unique > > > ids) > > > Each cell of this array is a small number of values representing some > > > information for this identifier at this timestamp. > > > > > > The array is very sparse, some identifiers will only have one entry per > > day, > > > some will have millions. I thought HBase might be a good fit due to > the > > > scaling (I've got many terabytes of data to store) and the built-in > > > versioning of cells. Occasionally I need to overwrite previous cell > > values, > > > but always keep a complete history of previous values to produce > > > 'point-in-time' views of the dataset. > > > > > > My first HBase schema was along the lines of having an row per > timestamp: > > > YYYYMMDD_Milliseconds containing a column family for the identifiers, > > with > > > values stored in there. > > > > > > This gives me nice and fast lookup by timestamp, but does not work at > all > > > for looking up all values for a specific identifier over all times. > > Going > > > back to the 2d array description, I need to be able to slice along rows > > > (timestamps) or columns (identifiers). > > > > > > Any tips as to how achieve something like this using HBase? Am I using > > the > > > wrong tool for the job? Am I completely misunderstanding how this all > > > works? > > > > > > Thanks, > > > Brian > > > > >
