Over the past few years, it is only on the Solr user mailing list that I have read requests for greater than 2B documents in a single index - nobody has requested this on the Lucene user list. The point is that this is primarily a Lucene issue that just happens to get passed through to Solr users, but until there is a hue and cry among the non-Solr portion of the Lucene community, we simply have to wait for the Lucene committers to take action.
Another possibility would be for Solr to support a second level of sharding - automatically create a new shard when the number of documents in a shard exceeds some threshold. I mean, Solr does already support collections larger than 2B docs - they just need to be sharded. That threshold would have to be below the Lucene hard limit for sure, but it is probably appropriate to set the limit much lower, like 500 million or even my usual 100 million guideline. The theory being that each shard could be processed in parallel in a separate thread, so that Solr could actually search a multi-sharded index faster than as a single index - provided that you have enough CPU cores and RAM. Hmmm... I wonder what the Elasticsearch guys might be up to on this front? -- Jack Krupansky On Wed, Feb 11, 2015 at 8:05 AM, Erick Erickson <erickerick...@gmail.com> wrote: > bq: Are there any such structures? > > Well, I thought there were, but I've got to admit I can't call any to mind > immediately. > > bq: 2b is just the hard limit > > Yeah, I'm always a little nervous as to when Moore's Law will make > everything I know about current systems' performance obsolete. > > At any rate, I _can_ say with certainty that I have no interest at this > point in exceeding this limit. Of course that may change with > compelling use-cases ;).... > > Best, > Erick > > On Wed, Feb 11, 2015 at 4:14 AM, Toke Eskildsen <t...@statsbiblioteket.dk> > wrote: > > Erick Erickson [erickerick...@gmail.com] wrote: > > > >> I guess my $0.02 is that you'd have to have strong evidence that > extending > >> Lucene to 64 bit is even useful. Or more generally, useful enough to > pay the > >> penalty. All the structures that allocate maxDoc id arrays would > suddenly > >> require twice the memory for instance, > > > > Are there any such structures? It was my impressions that ID-structures > in Solr were either bitmaps, hashmaps or queues. Anyway, if the number of > places with full-size ID-arrays is low, there could be dual implementations > selected by maxDoc. > > > >> plus all the coding effort that could be spend doing other things. > > > > Very true. I agree that at the current stage, > 2b/shard is still a bit > too special to spend a lot of effort on it. > > > > However, 2b is just the hard limit. As has been discussed before, single > shards works best in the lower end of the hundreds of millions of > documents. One reason is that many parts of Lucene works single-threaded on > structures that scale linear to document count. Having some hundreds of > millions of documents (log analysis being the typical case) is not uncommon > these days. A gradual shift to more multi-thread oriented processing would > fit well with current trends in hardware as well as use cases. As opposed > to the int->long switch, there would be little to no penalty for setups > with low maxDocs (they would just use 1 thread). > > > > - Toke Eskildsen >