Yes, there are some known problems while scaling to large number of
collections, say 1000 or above. See
https://issues.apache.org/jira/browse/SOLR-7191

On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera <ser...@gmail.com> wrote:

> Thanks Jack for your response. But I think Arnon's question was different.
>
> If you need to index 10,000 different collection of documents in Solr (say
> a collection denotes someone's Dropbox files), then you have two options:
> index all collections in one Solr collection, and add a field like
> collectionID to each document and query, or index each user's private
> collection in a different Solr collection.
>
> The pros of the latter is that you don't need to add a collectionID filter
> to each query. Also from a security/privacy standpoint (and search quality)
> - a user can only ever search what he has access to -- e.g. it cannot get a
> spelling correction for words he never saw in his documents, nor document
> suggestions (even though the 'context' in some of Lucene suggesters allow
> one to do that too). From a quality standpoint you don't mix different term
> statistics etc.
>
> So from a single node's point of view, you can either index 100M documents
> in one index (Collection, shard, replica -- whatever -- a single Solr
> core), or in 10,000 such cores. From node capacity perspectives the two are
> the same -- same amount of documents will be indexed overall, same query
> workload etc.
>
> So the question is purely about Solr and its collections management -- is
> there anything in that process that can prevent one from managing thousands
> of collections on a single node, or within a single SolrCloud instance? If
> so, what is it -- are these the ZK watchers? Is there a thread per
> collection at work? Others?
>
> Shai
>
> On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > As a general rule, there are only two ways that Solr scales to large
> > numbers: large number of documents and moderate number of nodes (shards
> and
> > replicas). All other parameters should be kept relatively small, like
> > dozens or low hundreds. Even shards and replicas should probably kept
> down
> > to that same guidance of dozens or low hundreds.
> >
> > Tens of millions of documents should be no problem. I recommend 100
> million
> > as the rough limit of documents per node. Of course it all depends on
> your
> > particular data model and data and hardware and network, so that number
> > could be smaller or larger.
> >
> > The main guidance has always been to simply do a proof of concept
> > implementation to test for your particular data model and data values.
> >
> > -- Jack Krupansky
> >
> > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <arn...@il.ibm.com> wrote:
> >
> > > We're running some tests on Solr and would like to have a deeper
> > > understanding of its limitations.
> > >
> > > Specifically, We have tens of millions of documents (say 50M) and are
> > > comparing several "#collections X #docs_per_collection" configurations.
> > > For example, we could have a single collection with 50M docs or 5000
> > > collections with 10K docs each.
> > > When trying to create the 5000 collections, we start getting frequent
> > > errors after 1000-1500 collections have been created. Feels like some
> > > limit has been reached.
> > > These tests are done on a single node + an additional node for replica.
> > >
> > > Can someone elaborate on what could limit Solr to a high number of
> > > collections (if at all)?
> > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > > anything in Solr that can prevent it? Where would it break?
> > >
> > > Thanks,
> > > Arnon
> >
>



-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to