Hi,

  I'm trying to figure the best way to design/allocate shards for our Solr
Cloud environment.Our current index has around 20 million documents, in 10
languages. Around 25-30% of the content is in English. Rest are almost
equally distributed among the remaining 13 languages. Till now, we had to
deal with query time deduplication using collapsing parser  for which we
used multi-level composite routing. But due to that, documents were
disproportionately distributed across 3 shards. The shard containing the
duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a
30gb index while Shard2 and Shard3 10gb each. The composite key is
currently made of "language!dedup_id!url" . At query time, we are using
shard.keys=language/8! for three level routing.

Due to performance overhead, we decided to move the de-duplication logic
during index time which made the composite routing redundant. We are not
discarding the duplicate content so there's no change in index size.Before
I update the routing key, just wanted to check what will be the best
approach to the sharding architecture so that we get optimal performance.
We've currently have 3 shards wth 2 replicas each. The entire index resides
in one single collection. What I'm trying to understand is whether:

1. We let Solr use simple document routing based on id and route the
documents to any of the 3 shards
2. We create a composite id using language, e.g. language!unique_id and
make sure that the same language content will always be in same the shard.
What I'm not sure is whether the index will be equally distributed across
the three shards.
3. Index English only content to a dedicated shard, rest equally
distributed to the remaining two. I'm not sure if that's possible.
4. Create a dedicated collection for English and one for rest of the
languages.

Any pointers on this will be highly appreciated.

Regards,
Shamik

Reply via email to