Yes, there are some known problems while scaling to large number of collections, say 1000 or above. See https://issues.apache.org/jira/browse/SOLR-7191
On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera <ser...@gmail.com> wrote: > Thanks Jack for your response. But I think Arnon's question was different. > > If you need to index 10,000 different collection of documents in Solr (say > a collection denotes someone's Dropbox files), then you have two options: > index all collections in one Solr collection, and add a field like > collectionID to each document and query, or index each user's private > collection in a different Solr collection. > > The pros of the latter is that you don't need to add a collectionID filter > to each query. Also from a security/privacy standpoint (and search quality) > - a user can only ever search what he has access to -- e.g. it cannot get a > spelling correction for words he never saw in his documents, nor document > suggestions (even though the 'context' in some of Lucene suggesters allow > one to do that too). From a quality standpoint you don't mix different term > statistics etc. > > So from a single node's point of view, you can either index 100M documents > in one index (Collection, shard, replica -- whatever -- a single Solr > core), or in 10,000 such cores. From node capacity perspectives the two are > the same -- same amount of documents will be indexed overall, same query > workload etc. > > So the question is purely about Solr and its collections management -- is > there anything in that process that can prevent one from managing thousands > of collections on a single node, or within a single SolrCloud instance? If > so, what is it -- are these the ZK watchers? Is there a thread per > collection at work? Others? > > Shai > > On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > > > As a general rule, there are only two ways that Solr scales to large > > numbers: large number of documents and moderate number of nodes (shards > and > > replicas). All other parameters should be kept relatively small, like > > dozens or low hundreds. Even shards and replicas should probably kept > down > > to that same guidance of dozens or low hundreds. > > > > Tens of millions of documents should be no problem. I recommend 100 > million > > as the rough limit of documents per node. Of course it all depends on > your > > particular data model and data and hardware and network, so that number > > could be smaller or larger. > > > > The main guidance has always been to simply do a proof of concept > > implementation to test for your particular data model and data values. > > > > -- Jack Krupansky > > > > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <arn...@il.ibm.com> wrote: > > > > > We're running some tests on Solr and would like to have a deeper > > > understanding of its limitations. > > > > > > Specifically, We have tens of millions of documents (say 50M) and are > > > comparing several "#collections X #docs_per_collection" configurations. > > > For example, we could have a single collection with 50M docs or 5000 > > > collections with 10K docs each. > > > When trying to create the 5000 collections, we start getting frequent > > > errors after 1000-1500 collections have been created. Feels like some > > > limit has been reached. > > > These tests are done on a single node + an additional node for replica. > > > > > > Can someone elaborate on what could limit Solr to a high number of > > > collections (if at all)? > > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there > > > anything in Solr that can prevent it? Where would it break? > > > > > > Thanks, > > > Arnon > > > -- Regards, Shalin Shekhar Mangar.