Hi, I'm looking for some input on design considerations for defining collections in a SolrCloud cluster. Right now, our cluster consists of two collections in a 2 shard / 2 replica mode. Each collection has a dedicated set of source and don't overlap, which made it an easy decision. Recently, we've a requirement to index a bunch of new sources that are region based. The search result corresponding to those region needs to come from their specific source as well sources from one of our existing collection. Here's an example of our existing collection and their corresponding source(s).
Existing Collection: -------------------------- Collection A --> Source_A, Source_B Collection B --> Source_C, Source_D, Source_E Proposed Collection: ---------------------------- Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E The proposed collection part shows that each geo has its dedicated source as well as source(s) from existing collection B. Just wondering if creating a dedicated collection for each geo is the right approach here. The main motivation is to support a geo-specific relevancy model which can easily be customized without stepping into each other. On the downside, I'm not sure if it's a good idea to replicate data from the same source across various collections. Moreover, the data within the source are not relational, so joining across collection might not be an easy proposition. The other consideration is the hardware design. Right now, both shards and their replicas run on their dedicated instance. With two collections, we sometimes run into OOM scenarios, so I'm a little bit worried about adding more collections. Does the best practice (I know it's subjective) in scenarios like this call for a dedicated Solr cluster per collection? From index size perspective, Source_C,Source_D and Source_E combines close to10 million documents with 60gb volume size. Each geo based source is small, won't exceed more than 500k documents. Any pointers will be appreciated. Thanks, Shamik