On 8/2/2011 12:06 PM, Jonathan Rochkind wrote:
What's the reasoning behind having three shards on one machine, instead of just combining those into one shard? Just curious. I had been thinking the point of shards was to get them on different machines, and there'd be no reason to have multiple shards on one machine.

I'd be interested in hearing Tom's answer as well, but my answer boils down to the time it takes to do a full index rebuild and worries about performance.

Because I'm in a virtualized environment, I effectively have three large shards on each machine even though they are logically separate. When I first got involved, we had a distributed EasyAsk index on 20 separate low-end physical servers. That evolved into basically the same solution with a smaller number of virtual machines, on a pair of very powerful physical hosts. On this system, doing a full rebuild took nearly two days and wasn't an atomic operation. The EasyAsk system (also based on Lucene) was unable to deal with more than about 4 million documents per machine (real or virtual). The only way to get acceptable performance was distributed search. The cost of providing redundancy was too high, so we didn't have any.

When we first started implementing Solr, we assumed from our previous experience that we'd need distributed search, especially if query volume were to go up. For that reason, we continued our virtualization model, but with only seven shards - six large "static" shards and a smaller "incremental" shard to hold data less than a week old. This is where we are now, and performance is MUCH better than the old solution. The low shard count made redundancy affordable, so we now have that too.

At the time Solr was first implemented, we could rebuild the entire index in about two hours and swap it into place all at once. Our index has grown enough since then that it takes a little less than three hours, which is still pretty quick for 60 million documents.

I did try some early tests with a single large index. Performance was pretty decent once it got warmed up, but I was worried about how it would perform under a heavy load, and how it would cope with frequent updates. I never really got very far with testing those fears, because the full rebuild time was unacceptable - at least 8 hours. The source database can keep up with six DIH instances reindexing at once, which completes much quicker than a single machine grabbing the entire database. I may increase the number of shards after I remove virtualization, but I'll need to fix a few limitations in my build system.

Thanks,
Shawn

Reply via email to