Re: Solr Cloud sharding strategy
What do you mean "the rest of the cluster"? The routing is based on the key provided. All of the "enu" prefixes will go to one of your shards. All the "deu" docs will appear on one shard. All the "esp" will be on one shard. All the "chs" docs will be on one shard. Which shard will each go to? Good question. Especially when you have small numbers of keys and/or one of the keys has a majority of your corpus you can end up with very uneven distributions. If you require individual control, what I'd do is create separate _collections_ for each language, then use collection aliasing to have a single URL to query. Of course that requires that you index to the correct collection You could also create a collection for the language with the most docs and one for "everything else". Or The advantage here is that the collection can be tailored to the number of docs. That is, the Spanish collection may be a single shard whereas the English one may be 4 shards But really, with a corpus this size I wouldn't worry about it. I suspect you're over-thinking the problem. And one addendum to Walter's comment... I often turn caching off (or wy down) when doing perf testing if I can't mine logs for, say, 100K queries in an attempt to negate effects of caching, but that doesn't force swapping though which is its weakness. I worked with one client that was thrilled at getting < 5ms response times for their stress tests with many threads simultaneously executing queries except they were firing the exact same query over and over and over. Best, Erick On Mon, Mar 7, 2016 at 7:36 PM, shamik wrote: > Thanks Eric and Walter, this is extremely insightful. One last followup > question on composite routing. I'm trying to have a better understanding of > index distribution. If I use language as a prefix, SolrCloud guarantees that > same language content will be routed to the same shard. What I'm curious to > know is how rest of the data is being distributed across remaining shards. > For e.g. I've the following composite keys, > > enu!doc1 > enu!doc2 > deu!doc3 > deu!doc4 > esp!doc5 > chs!doc6 > > If I've 2 shards in the cluster, will SolrCloud try to distribute the above > data evenly? Is is possible that enu will be routed to shard1 while deu goes > to shard2, and esp and chs gets indexed in either of them. Or, all of them > can potentially end up getting indexed in the same shard, either 1 or 2, > leaving one shard under-utilized. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262336.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud sharding strategy
Thanks Eric and Walter, this is extremely insightful. One last followup question on composite routing. I'm trying to have a better understanding of index distribution. If I use language as a prefix, SolrCloud guarantees that same language content will be routed to the same shard. What I'm curious to know is how rest of the data is being distributed across remaining shards. For e.g. I've the following composite keys, enu!doc1 enu!doc2 deu!doc3 deu!doc4 esp!doc5 chs!doc6 If I've 2 shards in the cluster, will SolrCloud try to distribute the above data evenly? Is is possible that enu will be routed to shard1 while deu goes to shard2, and esp and chs gets indexed in either of them. Or, all of them can potentially end up getting indexed in the same shard, either 1 or 2, leaving one shard under-utilized. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262336.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud sharding strategy
Excellent advice, and I’d like to reinforce a few things. * Solr indexing is CPU intensive and generates lots of disk IO. Faster CPUs and faster disks matter a lot. * Realistic user query logs are super important. We measure 95th percentile latency and that is dominated by rare and malformed queries. * 5000 queries is not nearly enough. That totally fits in cache. I usually start with 100K, though I’d like more. Benchmarking a cached system is one of the hardest things in devops. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 7, 2016, at 4:27 PM, Erick Erickson wrote: > > Still, 50M is not excessive for a single shard although it's getting > into the range that I'd like proof that my hardware etc. is adequate > before committing to it. I've seen up to 300M docs on a single > machine, admittedly they were tweets. YMMV based on hardware and index > complexity of course. Here's a long blog about sizing: > https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > In this case I'd be pretty comfortable by creating a test harness > (using jMeter or the like) and faking the extra 30M documents by > re-indexing the current corpus but assigning new IDs ( Keep doing this until your target machine breaks (i.e. either blows up > by exhausting memory or the response slows unacceptably) and that'll > give you a good upper bound. Note that you should plan on a couple of > rounds of tuning/testing when you start to have problems. > > I'll warn you up front, though, that unless you have an existing app > to mine for _real_ user queries, generating say 5,000 "typical" > queries is more of a challenge than you might expect ;)... > > Now, all that said all is not lost if you do go with a single shard. > Let's say that 6 months down the road your requirements change. Or the > initial estimate was off. Or > > There are a couple of options: > 1> create a new collection with more shards and re-index from scratch > 2> use the SPLITSHARD Collections API all to, well, split the shard. > > > In this latter case, a shard is split into two pieces of roughly equal > size, which does mean that you can only grow your shard count by > powers of 2. > > And even if you do have a single shard, using SolrCloud is still a > good thing as the failover is automagically handled assuming you have > more than one replica... > > Best, > Erick > > On Mon, Mar 7, 2016 at 4:05 PM, shamik wrote: >> Thanks a lot, Erick. You are right, it's a tad small with around 20 million >> documents, but the growth projection around 50 million in next 6-8 months. >> It'll continue to grow, but maybe not at the same rate. From the index size >> point of view, the size can grow up to half a TB from its current state. >> Honestly, my perception of "big" index is still vague :-) . All I'm trying >> to make sure is that decision I take is scalable in the long term and will >> be able to sustain the growth without compromising the performance. >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html >> Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud sharding strategy
Still, 50M is not excessive for a single shard although it's getting into the range that I'd like proof that my hardware etc. is adequate before committing to it. I've seen up to 300M docs on a single machine, admittedly they were tweets. YMMV based on hardware and index complexity of course. Here's a long blog about sizing: https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ In this case I'd be pretty comfortable by creating a test harness (using jMeter or the like) and faking the extra 30M documents by re-indexing the current corpus but assigning new IDs ( create a new collection with more shards and re-index from scratch 2> use the SPLITSHARD Collections API all to, well, split the shard. In this latter case, a shard is split into two pieces of roughly equal size, which does mean that you can only grow your shard count by powers of 2. And even if you do have a single shard, using SolrCloud is still a good thing as the failover is automagically handled assuming you have more than one replica... Best, Erick On Mon, Mar 7, 2016 at 4:05 PM, shamik wrote: > Thanks a lot, Erick. You are right, it's a tad small with around 20 million > documents, but the growth projection around 50 million in next 6-8 months. > It'll continue to grow, but maybe not at the same rate. From the index size > point of view, the size can grow up to half a TB from its current state. > Honestly, my perception of "big" index is still vague :-) . All I'm trying > to make sure is that decision I take is scalable in the long term and will > be able to sustain the growth without compromising the performance. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud sharding strategy
Thanks a lot, Erick. You are right, it's a tad small with around 20 million documents, but the growth projection around 50 million in next 6-8 months. It'll continue to grow, but maybe not at the same rate. From the index size point of view, the size can grow up to half a TB from its current state. Honestly, my perception of "big" index is still vague :-) . All I'm trying to make sure is that decision I take is scalable in the long term and will be able to sustain the growth without compromising the performance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud sharding strategy
20M docs is actually a very small collection by the "usual" Solr standards unless they're _really_ large documents, i.e. large books. Actually, I wouldn't even shard to begin with, it's unlikely that it's necessary and it adds inevitable overhead. If you _must_ shard, just go with <1>, but again I would be surprised if it was even necessary. Best, Erick On Mon, Mar 7, 2016 at 2:35 PM, Shamik Bandopadhyay wrote: > Hi, > > I'm trying to figure the best way to design/allocate shards for our Solr > Cloud environment.Our current index has around 20 million documents, in 10 > languages. Around 25-30% of the content is in English. Rest are almost > equally distributed among the remaining 13 languages. Till now, we had to > deal with query time deduplication using collapsing parser for which we > used multi-level composite routing. But due to that, documents were > disproportionately distributed across 3 shards. The shard containing the > duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a > 30gb index while Shard2 and Shard3 10gb each. The composite key is > currently made of "language!dedup_id!url" . At query time, we are using > shard.keys=language/8! for three level routing. > > Due to performance overhead, we decided to move the de-duplication logic > during index time which made the composite routing redundant. We are not > discarding the duplicate content so there's no change in index size.Before > I update the routing key, just wanted to check what will be the best > approach to the sharding architecture so that we get optimal performance. > We've currently have 3 shards wth 2 replicas each. The entire index resides > in one single collection. What I'm trying to understand is whether: > > 1. We let Solr use simple document routing based on id and route the > documents to any of the 3 shards > 2. We create a composite id using language, e.g. language!unique_id and > make sure that the same language content will always be in same the shard. > What I'm not sure is whether the index will be equally distributed across > the three shards. > 3. Index English only content to a dedicated shard, rest equally > distributed to the remaining two. I'm not sure if that's possible. > 4. Create a dedicated collection for English and one for rest of the > languages. > > Any pointers on this will be highly appreciated. > > Regards, > Shamik
Solr Cloud sharding strategy
Hi, I'm trying to figure the best way to design/allocate shards for our Solr Cloud environment.Our current index has around 20 million documents, in 10 languages. Around 25-30% of the content is in English. Rest are almost equally distributed among the remaining 13 languages. Till now, we had to deal with query time deduplication using collapsing parser for which we used multi-level composite routing. But due to that, documents were disproportionately distributed across 3 shards. The shard containing the duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a 30gb index while Shard2 and Shard3 10gb each. The composite key is currently made of "language!dedup_id!url" . At query time, we are using shard.keys=language/8! for three level routing. Due to performance overhead, we decided to move the de-duplication logic during index time which made the composite routing redundant. We are not discarding the duplicate content so there's no change in index size.Before I update the routing key, just wanted to check what will be the best approach to the sharding architecture so that we get optimal performance. We've currently have 3 shards wth 2 replicas each. The entire index resides in one single collection. What I'm trying to understand is whether: 1. We let Solr use simple document routing based on id and route the documents to any of the 3 shards 2. We create a composite id using language, e.g. language!unique_id and make sure that the same language content will always be in same the shard. What I'm not sure is whether the index will be equally distributed across the three shards. 3. Index English only content to a dedicated shard, rest equally distributed to the remaining two. I'm not sure if that's possible. 4. Create a dedicated collection for English and one for rest of the languages. Any pointers on this will be highly appreciated. Regards, Shamik