Over the past few years, it is only on the Solr user mailing list that I
have read requests for greater than 2B documents in a single index - nobody
has requested this on the Lucene user list. The point is that this is
primarily a Lucene issue that just happens to get passed through to Solr
users, but until there is a hue and cry among the non-Solr portion of the
Lucene community, we simply have to wait for the Lucene committers to take
action.

Another possibility would be for Solr to support a second level of sharding
- automatically create a new shard when the number of documents in a shard
exceeds some threshold. I mean, Solr does already support collections
larger than 2B docs - they just need to be sharded. That threshold would
have to be below the Lucene hard limit for sure, but it is probably
appropriate to set the limit much lower, like 500 million or even my usual
100 million guideline. The theory being that each shard could be processed
in parallel in a separate thread, so that Solr could actually search a
multi-sharded index faster than as a single index - provided that you have
enough CPU cores and RAM.

Hmmm... I wonder what the Elasticsearch guys might be up to on this front?


-- Jack Krupansky

On Wed, Feb 11, 2015 at 8:05 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: Are there any such structures?
>
> Well, I thought there were, but I've got to admit I can't call any to mind
> immediately.
>
> bq: 2b is just the hard limit
>
> Yeah, I'm always a little nervous as to when Moore's Law will make
> everything I know about current systems' performance obsolete.
>
> At any rate, I _can_ say with certainty that I have no interest at this
> point in exceeding this limit. Of course that may change with
> compelling use-cases ;)....
>
> Best,
> Erick
>
> On Wed, Feb 11, 2015 at 4:14 AM, Toke Eskildsen <t...@statsbiblioteket.dk>
> wrote:
> > Erick Erickson [erickerick...@gmail.com] wrote:
> >
> >> I guess my $0.02 is that you'd have to have strong evidence that
> extending
> >> Lucene to 64 bit is even useful. Or more generally, useful enough to
> pay the
> >> penalty. All the structures that allocate maxDoc id arrays would
> suddenly
> >> require twice the memory for instance,
> >
> > Are there any such structures? It was my impressions that ID-structures
> in Solr were either bitmaps, hashmaps or queues. Anyway, if the number of
> places with full-size ID-arrays is low, there could be dual implementations
> selected by maxDoc.
> >
> >> plus all the coding effort that could be spend doing other things.
> >
> > Very true. I agree that at the current stage, > 2b/shard is still a bit
> too special to spend a lot of effort on it.
> >
> > However, 2b is just the hard limit. As has been discussed before, single
> shards works best in the lower end of the hundreds of millions of
> documents. One reason is that many parts of Lucene works single-threaded on
> structures that scale linear to document count. Having some hundreds of
> millions of documents (log analysis being the typical case) is not uncommon
> these days. A gradual shift to more multi-thread oriented processing would
> fit well with current trends in hardware as well as use cases. As opposed
> to the int->long switch, there would be little to no penalty for setups
> with low maxDocs (they would just use 1 thread).
> >
> > - Toke Eskildsen
>

Reply via email to