Re: performance crossover between single index and sharding

Shawn Heisey Tue, 02 Aug 2011 12:01:14 -0700

On 8/2/2011 12:06 PM, Jonathan Rochkind wrote:

What's the reasoning behind having three shards on one machine,instead of just combining those into one shard? Just curious. I hadbeen thinking the point of shards was to get them on differentmachines, and there'd be no reason to have multiple shards on onemachine.

I'd be interested in hearing Tom's answer as well, but my answer boilsdown to the time it takes to do a full index rebuild and worries aboutperformance.

Because I'm in a virtualized environment, I effectively have three largeshards on each machine even though they are logically separate. When Ifirst got involved, we had a distributed EasyAsk index on 20 separatelow-end physical servers. That evolved into basically the same solutionwith a smaller number of virtual machines, on a pair of very powerfulphysical hosts. On this system, doing a full rebuild took nearly twodays and wasn't an atomic operation. The EasyAsk system (also based onLucene) was unable to deal with more than about 4 million documents permachine (real or virtual). The only way to get acceptable performancewas distributed search. The cost of providing redundancy was too high,so we didn't have any.

When we first started implementing Solr, we assumed from our previousexperience that we'd need distributed search, especially if query volumewere to go up. For that reason, we continued our virtualization model,but with only seven shards - six large "static" shards and a smaller"incremental" shard to hold data less than a week old. This is where weare now, and performance is MUCH better than the old solution. The lowshard count made redundancy affordable, so we now have that too.

At the time Solr was first implemented, we could rebuild the entireindex in about two hours and swap it into place all at once. Our indexhas grown enough since then that it takes a little less than threehours, which is still pretty quick for 60 million documents.

I did try some early tests with a single large index. Performance waspretty decent once it got warmed up, but I was worried about how itwould perform under a heavy load, and how it would cope with frequentupdates. I never really got very far with testing those fears, becausethe full rebuild time was unacceptable - at least 8 hours. The sourcedatabase can keep up with six DIH instances reindexing at once, whichcompletes much quicker than a single machine grabbing the entiredatabase. I may increase the number of shards after I removevirtualization, but I'll need to fix a few limitations in my build system.


Thanks,
Shawn

Re: performance crossover between single index and sharding

Reply via email to