Re: using solr for time series data

Ted Dunning Thu, 19 Jan 2012 14:23:39 -0800

Take a look at openTSDB.

You might want to use that as is, or steal some of the concepts.  The major
idea to snitch is the idea of using a single row of hte data base (document
in Lucene or Solr) to hold many data points.

Thus, you could consider having documents with the following fields:

key: entity+time-period
kv: repeated key-value pairs parsed as keywords
data: a protobuf or avro encoded blob, stored but not indexed.  This would
include time offsets and values.  You might also have a file reference here.

If you were to follow the openTSDB example, the entity would be a metric
and the kv pairs would have things like hostnames and such and would allow
you to specialize the retrieved data.  Under openTSDB is hbase which does
some cleverness to get contiguous I/O for consecutive records from the same
entity.  That really makes things fast.  If you make the data blobs in Solr
big enough, you might get a similar win.  Or not.  Try it.

For instance, if you store and entire year of values in a blob, then you
will have 20 x 500,000 = 10M documents and each blob will be a few
kilobytes.  This means you probably could store everything in memory with a
few tens of GB of memory.  With less memory, most graphs would require at
most a half dozen or so blobs which could be pulled in a single query.  I
would guess that most queries would be very fast even if backed by disk.

On Thu, Jan 19, 2012 at 9:09 PM, Robert Stewart <bstewart...@gmail.com>wrote:

> I have a project where the client wants to store time series data
> (maybe in SOLR if it can work).  We want to store daily "prices" over
> last 20 years (about 6000 values with associate dates), for up to
> 500,000 entities.
>
> This data currently exists in a SQL database.  Access to SQL is too
> slow for clients needs at this point.  The requirements are to fetch
> up to 6000 daily prices for an entity and render a chart in real-time
> on a web page.
>
> One way we can do it is to generate one document for every daily
> price, per entity, so we have 500,000 * 6000 = 3 billion docs in SOLR.
>  We created simple proof of concept with 10 million documents and it
> works perfectly.  But, I assume up to 3 billion small documents is too
> much for a single index.  What is the hard limit on the total # of
> documents you can put into a SOLR index (regardless of memory, disk
> space, etc.)?  The good thing about this approach is it works fine
> using existing data import handler for SQL.  I know we can shard the
> index per entity using some hash, but want to know what upper limit
> per index is.
>
> Another way is to store each set of 6000 prices as some blog (maybe
> JSON) as single field on a document, and have one document per entity
> (500,000 documents).  That will work, but there is no way to do this
> using existing data import handlers correct?  If possible I dont want
> to develop custom import handler or data loader unless I absolutely
> have to.  Is there some template function or something available in
> current DIH features to make this work?
>
> Thanks
> Bob
>

Re: using solr for time series data

Reply via email to