I think your advice on future incremental update is very useful. I will keep eye on that.
Actually, I am currently interested in how to boost merging/optimizing performance of single solr instance. Parallelism at MapReduce level does not help merging/optimizing much, unless Solr/Lucene internally has distributed indexing mechanism like threading. Specifically, I am talking about the parameters in // ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnceExplicit( *10000*); // ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnce(*10000*); // ((TieredMergePolicy) mergePolicy).setSegmentsPerTier(*10000*); https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L119-121 Do you know how they affect merging/optimizing the performance? or do you know any doc about them? I tried to uncomment them, and the performance improved. And I am considering further tune the parameters. As you mentioned, IndexWriter.forceMerge does exist in line 153 of https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L153 I am very grateful for your advice. Thanks a lot. On Mon, Jun 15, 2015 at 10:39 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Ah, OK. For very slowly changing indexes optimize can makes sense. > > Do note, though, that if you incrementally index after the full build, and > especially if you update documents, you're laying a trap for the future. > Let's > say you optimize down to a single segment. The default TieredMergePolicy > tries to merge "similar size segments". But now you have one huge segment > and docs will be marked as deleted from that segment, but not cleaned up > until that segment is merged, which won't happen for a long time since it > is so much bigger (I'm assuming) than the segments the incremental indexing > will create. > > Now, the percentage of deleted documents weighs quite heavily in the > decision > what segments to merge, so it might not matter. It's just something to > be aware of. > Surely benchmarking is in order as you indicated. > > The Lucene-level IndexWriter.forceMerge method seems to be what you need > though, although if you're working over HDFS I'm in unfamiliar territory. > But > the constructors to IndexWriter take a Directory, and the HdfsDirectory > extends BaseDirectory which extends Directory so if you can set up > an HdfsDIrectory it should "just work". I haven't personally tried it > though. > > I saw something recently where optimization helped considerably in a > sharded situation where the rows parameter was 400 (10 shards). My > belief is that what was really happening was that the first-pass of a > distributed search was getting slowed by disk seeks across multiple > smaller segments. I'm waiting for SOLR-6810 which should impact that > problem. Don't know if it applies to your situation or not though. > > HTH, > Erick > > > On Mon, Jun 15, 2015 at 8:30 PM, Shenghua(Daniel) Wan > <wansheng...@gmail.com> wrote: > > Hi, Erick, > > First thanks for sharing the ideas. I am further giving more context here > > accordingly. > > > > 1. why optimize? I have done some experiments to compare the query > response > > time, and there is some difference. In addition, the searcher will be > > customer-facing. I think any performance boost will be worthwhile unless > > the indexing will be more frequent. However, more benchmark will be > > necessary to quantize the margin. > > > > 2. Why embedded solr server? I adopted the idea from Mark Miller's > > map-reduce indexing and build on top of its original contribution to > Solr. > > It launches an embedded solr server at the end of reducer stages. > Basically > > a solr "instance" is brought up and fed with documents. Then the index is > > generated at each reducer. Then the indexes are merged, and optimized if > > desired. > > > > Thanks. > > > > On Mon, Jun 15, 2015 at 5:06 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> The first question is why you're optimizing at all. It's not recommended > >> unless you can demonstrate that an optimized index is giving you enough > >> of a performance boost to be worth the effort. > >> > >> And why are you using embedded solr server? That's kind of unusual > >> so I wonder if you've gone down a wrong path somewhere. In other > >> words this feels like an XY problem, you're specifically asking about > >> a task without explaining the problem you're trying to solve, there may > >> be better alternatives. > >> > >> Best, > >> Erick > >> > >> On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan > >> <wansheng...@gmail.com> wrote: > >> > Hi, > >> > Do you have any suggestions to improve the performance for merging and > >> > optimizing index? > >> > I have been using embedded solr server to merge and optimize the > index. I > >> > am looking for the right parameters to tune. My use case have about > 300 > >> > fields plus 250 copyfields, and moderate doc size (about 65K each doc > >> > averagely) > >> > > >> > https://wiki.apache.org/solr/MergingSolrIndexes does not help much. > >> > > >> > Thanks a lot for any ideas and suggestions. > >> > > >> > -- > >> > > >> > Regards, > >> > Shenghua (Daniel) Wan > >> > > > > > > > > -- > > > > Regards, > > Shenghua (Daniel) Wan > -- Regards, Shenghua (Daniel) Wan