The strategy doesn't require putting all the recent data on a single node.

What has been suggested is collection based - the most recent data will simply 
be in it's own collection, that may or may not be on a single node.

This is pretty much always going to be advantageous for time series data. You 
can manage things as Otis said - even having a collection for the past hour if 
that's what you need to do. That last hour can be distributed over as many 
nodes as you need.

Older collections will be stable and can be heavily cached, perhaps serve 
queries more slowly, and can be easily merged together or removed from the 
system.

Nothing about it is automatically good, but this is the type of architecture 
you pretty much always want with time series data if you really want to scale. 
There are lot's of little twists around it and different tradeoffs you can 
make, but by and large, the main concepts are the same.

There are many things to consider when building this architecture - for 
example, you may want to carefully control collection placement rather than 
accepting the std random distribution across nodes.

Eventually it should become fairly simple to implement this pattern with a 
single collection as the custom sharding features advance.

- Mark

On Jul 6, 2013, at 7:23 AM, Erick Erickson <erickerick...@gmail.com> wrote:

> Not saying it's always one way or the other, just
> that one shouldn't automatically _assume_
> putting the most recent data on a single node
> is automatically good. It may well be, but
> not in all cases.
> 
> 
> 
> 
> On Wed, Jul 3, 2013 at 12:21 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
> 
>> Exactly.  And the newest shard can also be kept small (e.g. maybe just
>> last 12h is OK to hit first and dig deeper only if you can't find
>> enough stories in the last 12h), which means it will fit in memory and
>> be crazy fast.
>> 
>> Otis
>> --
>> Solr & ElasticSearch Support -- http://sematext.com/
>> Performance Monitoring -- http://sematext.com/spm
>> 
>> 
>> 
>> On Wed, Jul 3, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>>> 
>>> On Jul 3, 2013, at 7:47 AM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>> 
>>>> Usually most people
>>>> care about today's news, and a hot story will
>>>> generate lots of queries, all of which are serviced
>>>> by today's shard.
>>> 
>>> That's really the whole point though - rather than slamming your whole
>> cluster with every search, the majority of people are just searching today
>> - which will have only a fraction of the data and will be able to hold up
>> very well to a large load. This is also how you can do really fast NRT on a
>> huge data set - it only has to happen on todays shard.
>>> 
>>> News sites have been using the trick forever with Lucene.
>>> 
>>> - Mark
>> 

Reply via email to