Excellent advice, and I’d like to reinforce a few things. * Solr indexing is CPU intensive and generates lots of disk IO. Faster CPUs and faster disks matter a lot. * Realistic user query logs are super important. We measure 95th percentile latency and that is dominated by rare and malformed queries. * 5000 queries is not nearly enough. That totally fits in cache. I usually start with 100K, though I’d like more. Benchmarking a cached system is one of the hardest things in devops.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 7, 2016, at 4:27 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > Still, 50M is not excessive for a single shard although it's getting > into the range that I'd like proof that my hardware etc. is adequate > before committing to it. I've seen up to 300M docs on a single > machine, admittedly they were tweets. YMMV based on hardware and index > complexity of course. Here's a long blog about sizing: > https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > In this case I'd be pretty comfortable by creating a test harness > (using jMeter or the like) and faking the extra 30M documents by > re-indexing the current corpus but assigning new IDs (<uniqueKey). > Keep doing this until your target machine breaks (i.e. either blows up > by exhausting memory or the response slows unacceptably) and that'll > give you a good upper bound. Note that you should plan on a couple of > rounds of tuning/testing when you start to have problems. > > I'll warn you up front, though, that unless you have an existing app > to mine for _real_ user queries, generating say 5,000 "typical" > queries is more of a challenge than you might expect ;)... > > Now, all that said all is not lost if you do go with a single shard. > Let's say that 6 months down the road your requirements change. Or the > initial estimate was off. Or.... > > There are a couple of options: > 1> create a new collection with more shards and re-index from scratch > 2> use the SPLITSHARD Collections API all to, well, split the shard. > > > In this latter case, a shard is split into two pieces of roughly equal > size, which does mean that you can only grow your shard count by > powers of 2. > > And even if you do have a single shard, using SolrCloud is still a > good thing as the failover is automagically handled assuming you have > more than one replica... > > Best, > Erick > > On Mon, Mar 7, 2016 at 4:05 PM, shamik <sham...@gmail.com> wrote: >> Thanks a lot, Erick. You are right, it's a tad small with around 20 million >> documents, but the growth projection around 50 million in next 6-8 months. >> It'll continue to grow, but maybe not at the same rate. From the index size >> point of view, the size can grow up to half a TB from its current state. >> Honestly, my perception of "big" index is still vague :-) . All I'm trying >> to make sure is that decision I take is scalable in the long term and will >> be able to sustain the growth without compromising the performance. >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html >> Sent from the Solr - User mailing list archive at Nabble.com.