Hamid, I have just finished working on a client project similar to this. We had 60 million documents (heading for 120m) when I stopped working there.
It is worth bearing in mind that, while Solr can certainly handle that size of index, when you get to that kind of scale, you are likely to have to put in substantial effort to optimise your setup. A colleague spent nearly three months working to get a reasonable performance out of our system. We were running the above 60 million document index on 10 shards, each on their own host. The main constraint we had was memory. Firstly, we made sure that our indexers were autowarmed before they were brought into use (upon commit). That way the first user didn't bear the price of waiting for caches to be populated. We also had to make sure that we were only warming one indexer at a time. Warming two would invariably blow up our memory (as that would mean we'd have three indexers, and their caches, in memory at that time). Some knowledge of how JVMs work is important too. We watched garbage collection (using jstat -gcutil) to see how frequently it was happening. Tuning of the various memory buckets helped (I believe we increased the size of our eden space, where short lived objects are kept) which reduced the impact of garbage collection, but in the end we found the optimal configuration simply by setting each of our 10 servers up with different settings and leaving them to run for a day. Looking at the output of jcutil at the end of that showed us how much garbage collection had been happening, and allowed us to identify the most optimal configuration. For us, almost every query involved faceting across 13 fields - a performance nightmare. If you don't need faceting, then you'll likely get a greater hit for your hardware money than we did, but you're still likely to need to pay attention to these kinds of issues when working at the scale you are. If your collection is growing in size, you will also need to work out a strategy for resharding. What do you do when any of your shards get larger than the optimal size for a Lucene index. Adding an extra (empty) shard isn't always the answer as it can screw up some relevance calculations. Our current approach to this is clumsy: to keep an offline archive of all content, allowing us to fire up a new row of shard servers (with a different number of shards) and completely re-index everything. Hope this helps. Upayavira On Wed, 24 Nov 2010 04:11 -0800, "Hamid Vahedi" <hvb...@yahoo.com> wrote: > Hi to all > > We using solr multi core with 6 core in shard mode per server (2 server > till > now. therefore totally 12 core). using tomcat on windows 2008 with 18GB > RAM > assign to it. > > We add almost 6 million doc per day to solr (up to 200 doc/sec) which > must > appear in query result real-time. (currently more than 350 million doc > indexed) > Query very slow (about 4-32 sec).but Update performance very good. > > note1: result must sort by publish date desc. > note2: Query on one shard also slow sometime (300ms-2s) > note3: we can't optimize index because always doc add. > > Can solr help me? > if yes, What's best configuration ? > if not, what is the best solution ? > > Kind Regards, > Hamid > > > >