The strategy doesn't require putting all the recent data on a single node. What has been suggested is collection based - the most recent data will simply be in it's own collection, that may or may not be on a single node.
This is pretty much always going to be advantageous for time series data. You can manage things as Otis said - even having a collection for the past hour if that's what you need to do. That last hour can be distributed over as many nodes as you need. Older collections will be stable and can be heavily cached, perhaps serve queries more slowly, and can be easily merged together or removed from the system. Nothing about it is automatically good, but this is the type of architecture you pretty much always want with time series data if you really want to scale. There are lot's of little twists around it and different tradeoffs you can make, but by and large, the main concepts are the same. There are many things to consider when building this architecture - for example, you may want to carefully control collection placement rather than accepting the std random distribution across nodes. Eventually it should become fairly simple to implement this pattern with a single collection as the custom sharding features advance. - Mark On Jul 6, 2013, at 7:23 AM, Erick Erickson <erickerick...@gmail.com> wrote: > Not saying it's always one way or the other, just > that one shouldn't automatically _assume_ > putting the most recent data on a single node > is automatically good. It may well be, but > not in all cases. > > > > > On Wed, Jul 3, 2013 at 12:21 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> Exactly. And the newest shard can also be kept small (e.g. maybe just >> last 12h is OK to hit first and dig deeper only if you can't find >> enough stories in the last 12h), which means it will fit in memory and >> be crazy fast. >> >> Otis >> -- >> Solr & ElasticSearch Support -- http://sematext.com/ >> Performance Monitoring -- http://sematext.com/spm >> >> >> >> On Wed, Jul 3, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> >> wrote: >>> >>> On Jul 3, 2013, at 7:47 AM, Erick Erickson <erickerick...@gmail.com> >> wrote: >>> >>>> Usually most people >>>> care about today's news, and a hot story will >>>> generate lots of queries, all of which are serviced >>>> by today's shard. >>> >>> That's really the whole point though - rather than slamming your whole >> cluster with every search, the majority of people are just searching today >> - which will have only a fraction of the data and will be able to hold up >> very well to a large load. This is also how you can do really fast NRT on a >> huge data set - it only has to happen on todays shard. >>> >>> News sites have been using the trick forever with Lucene. >>> >>> - Mark >>