Hamid,

I have just finished working on a client project similar to this. We had
60 million documents (heading for 120m) when I stopped working there.

It is worth bearing in mind that, while Solr can certainly handle that
size of index, when you get to that kind of scale, you are likely to
have to put in substantial effort to optimise your setup. A colleague
spent nearly three months working to get a reasonable performance out of
our system.

We were running the above 60 million document index on 10 shards, each
on their own host. The main constraint we had was memory. Firstly, we
made sure that our indexers were autowarmed before they were brought
into use (upon commit). That way the first user didn't bear the price of
waiting for caches to be populated.

We also had to make sure that we were only warming one indexer at a
time. Warming two would invariably blow up our memory (as that would
mean we'd have three indexers, and their caches, in memory at that
time).

Some knowledge of how JVMs work is important too. We watched garbage
collection (using jstat -gcutil) to see how frequently it was happening.
Tuning of the various memory buckets helped (I believe we increased the
size of our eden space, where short lived objects are kept) which
reduced the impact of garbage collection, but in the end we found the
optimal configuration simply by setting each of our 10 servers up with
different settings and leaving them to run for a day. Looking at the
output of jcutil at the end of that showed us how much garbage
collection had been happening, and allowed us to identify the most
optimal configuration.

For us, almost every query involved faceting across 13 fields - a
performance nightmare. If you don't need faceting, then you'll likely
get a greater hit for your hardware money than we did, but you're still
likely to need to pay attention to these kinds of issues when working at
the scale you are.

If your collection is growing in size, you will also need to work out a
strategy for resharding. What do you do when any of your shards get
larger than the optimal size for a Lucene index. Adding an extra (empty)
shard isn't always the answer as it can screw up some relevance
calculations. Our current approach to this is clumsy: to keep an offline
archive of all content, allowing us to fire up a new row of shard
servers (with a different number of shards) and completely re-index
everything.

Hope this helps.

Upayavira



On Wed, 24 Nov 2010 04:11 -0800, "Hamid Vahedi" <hvb...@yahoo.com>
wrote:
> Hi to all 
> 
> We using solr multi core with 6 core in shard mode per server (2 server
> till 
> now. therefore totally 12 core). using tomcat on windows 2008 with 18GB
> RAM 
> assign to it.
> 
> We add almost 6 million doc per day to solr (up to 200 doc/sec) which
> must 
> appear in query result real-time. (currently more than 350 million doc
> indexed)
> Query very slow (about 4-32 sec).but Update performance very good.
> 
> note1: result must sort by publish date desc.
> note2: Query on one shard also slow sometime (300ms-2s)
> note3: we can't optimize index because always doc add.
> 
> Can solr help me? 
> if yes, What's best configuration ? 
> if not, what is the best solution ?
> 
> Kind Regards,
> Hamid
> 
> 
> 
>       

Reply via email to