Hi, I'm trying to figure the best way to design/allocate shards for our Solr Cloud environment.Our current index has around 20 million documents, in 10 languages. Around 25-30% of the content is in English. Rest are almost equally distributed among the remaining 13 languages. Till now, we had to deal with query time deduplication using collapsing parser for which we used multi-level composite routing. But due to that, documents were disproportionately distributed across 3 shards. The shard containing the duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a 30gb index while Shard2 and Shard3 10gb each. The composite key is currently made of "language!dedup_id!url" . At query time, we are using shard.keys=language/8! for three level routing.
Due to performance overhead, we decided to move the de-duplication logic during index time which made the composite routing redundant. We are not discarding the duplicate content so there's no change in index size.Before I update the routing key, just wanted to check what will be the best approach to the sharding architecture so that we get optimal performance. We've currently have 3 shards wth 2 replicas each. The entire index resides in one single collection. What I'm trying to understand is whether: 1. We let Solr use simple document routing based on id and route the documents to any of the 3 shards 2. We create a composite id using language, e.g. language!unique_id and make sure that the same language content will always be in same the shard. What I'm not sure is whether the index will be equally distributed across the three shards. 3. Index English only content to a dedicated shard, rest equally distributed to the remaining two. I'm not sure if that's possible. 4. Create a dedicated collection for English and one for rest of the languages. Any pointers on this will be highly appreciated. Regards, Shamik