That's a laudable goal - to support low-latency queries - including faceting - for "hundreds of millions" of documents, using Solr "out of the box" on a random, commodity box selected by IT and just adding a dozen or two fields to the default schema that are both indexed and stored, without any "expert" tuning, by an "average" developer. The reality doesn't seem to be there today. 50 to 100 million documents, yes, but beyond that takes some kind of "heroic" effort, whether a much beefier box, very careful and limited data modeling or limiting of query capabilities or tolerance of higher latency, expert tuning, etc.
The proof is always in the pudding - pick a box, install Solr, setup the schema, load 20 or 50 or 100 or 250 or 350 million documents, try some queries with the features you need, and you get what you get. But I agree that it would be highly desirable to push that 100 million number up to 350 million or even 500 million ASAP since the pain of unnecessarily sharding is unnecessarily excessive. I wonder what changes will have to occur in Lucene, or... what evolution in commodity hardware will be necessary to get there. -- Jack Krupansky On Sat, Jan 3, 2015 at 6:11 PM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > Erick Erickson [erickerick...@gmail.com] wrote: > > I can't disagree. You bring up some of the points that make me > _extremely_ > > reluctant to try to get this in to 5.x though. 6.0 at the earliest I > should > > think. > > Ignoring the magic 2b number for a moment, I think the overall question is > whether or not single shards should perform well in the hundreds of > millions of documents range. The alternative is more shards, but it is > quite an explicit process to handle shard-juggling. From an end-user > perspective, the underlying technology matters little: Whatever the choice, > it should be possible to install "something" on a machine and expect it to > scale within the hardware limitations without much ado. > > - Toke Eskildsen >