Excellent advice, and I’d like to reinforce a few things.

* Solr indexing is CPU intensive and generates lots of disk IO. Faster CPUs and 
faster disks matter a lot.
* Realistic user query logs are super important. We measure 95th percentile 
latency and that is dominated by rare and malformed queries.
* 5000 queries is not nearly enough. That totally fits in cache. I usually 
start with 100K, though I’d like more. Benchmarking a cached system is one of 
the hardest things in devops.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 7, 2016, at 4:27 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Still, 50M is not excessive for a single shard although it's getting
> into the range that I'd like proof that my hardware etc. is adequate
> before committing to it. I've seen up to 300M docs on a single
> machine, admittedly they were tweets. YMMV based on hardware and index
> complexity of course. Here's a long blog about sizing:
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> In this case I'd be pretty comfortable by creating a test harness
> (using jMeter or the like) and faking the extra 30M documents by
> re-indexing the current corpus but assigning new IDs (<uniqueKey).
> Keep doing this until your target machine breaks (i.e. either blows up
> by exhausting memory or the response slows unacceptably) and that'll
> give you a good upper bound. Note that you should plan on a couple of
> rounds of tuning/testing when you start to have problems.
> 
> I'll warn you up front, though, that unless you have an existing app
> to mine for _real_ user queries, generating say 5,000 "typical"
> queries is more of a challenge than you might expect ;)...
> 
> Now, all that said all is not lost if you do go with a single shard.
> Let's say that 6 months down the road your requirements change. Or the
> initial estimate was off. Or....
> 
> There are a couple of options:
> 1> create a new collection with more shards and re-index from scratch
> 2> use the SPLITSHARD Collections API all to, well, split the shard.
> 
> 
> In this latter case, a shard is split into two pieces of roughly equal
> size, which does mean that you can only grow your shard count by
> powers of 2.
> 
> And even if you do have a single shard, using SolrCloud is still a
> good thing as the failover is automagically handled assuming you have
> more than one replica...
> 
> Best,
> Erick
> 
> On Mon, Mar 7, 2016 at 4:05 PM, shamik <sham...@gmail.com> wrote:
>> Thanks a lot, Erick. You are right, it's a tad small with around 20 million
>> documents, but the growth projection around 50 million in next 6-8 months.
>> It'll continue to grow, but maybe not at the same rate. From the index size
>> point of view, the size can grow up to half a TB from its current state.
>> Honestly, my perception of "big" index is still vague :-) . All I'm trying
>> to make sure is that decision I take is scalable in the long term and will
>> be able to sustain the growth without compromising the performance.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to