Hi,

    I'm looking for some input on design considerations for defining
collections in a SolrCloud cluster. Right now, our cluster consists of two
collections in a 2 shard / 2 replica mode. Each collection has a dedicated
set of source and don't overlap, which made it an easy decision.
Recently, we've a requirement to index a bunch of new sources that are
region based. The search result corresponding to those region needs to come
from their specific source as well sources from one of our existing
collection. Here's an example of our existing collection and their
corresponding source(s).

Existing Collection:
--------------------------
Collection A --> Source_A, Source_B
Collection B --> Source_C, Source_D, Source_E

Proposed Collection:
----------------------------
Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E
Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E
Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E

The proposed collection part shows that each geo has its dedicated source
as well as source(s) from existing collection B.

Just wondering if creating a dedicated collection for each geo is the right
approach here. The main motivation is to support a geo-specific relevancy
model which can easily be customized without stepping into each other. On
the downside, I'm not sure if it's a good idea to replicate data from the
same source across various collections. Moreover, the data within the
source are not relational, so joining across collection might not be
an easy proposition.
The other consideration is the hardware design. Right now, both shards and
their replicas run on their dedicated instance. With two collections, we
sometimes run into OOM scenarios, so I'm a little bit worried about adding
more collections. Does the best practice (I know it's subjective) in
scenarios like this call for a dedicated Solr cluster per collection? From
index size perspective, Source_C,Source_D and Source_E combines close to10
million documents with 60gb volume size. Each geo based source is small,
won't exceed more than 500k documents.

Any pointers will be appreciated.

Thanks,
Shamik

Reply via email to