​I think your advice on future incremental update is very useful. I will
keep eye on that.

Actually, I am currently interested in how to boost merging/optimizing
performance of single solr instance.
Parallelism at MapReduce level does not help merging/optimizing much,
unless Solr/Lucene internally has distributed indexing mechanism like
threading.

Specifically, I am talking about the parameters in
//          ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnceExplicit(
*10000*);
//          ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnce(*10000*);

//          ((TieredMergePolicy) mergePolicy).setSegmentsPerTier(*10000*);
https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L119-121
Do you know how they affect merging/optimizing the performance? or do you
know any doc about them?
I tried to uncomment them, and the performance improved. And I am
considering further tune the parameters.

As you mentioned, IndexWriter.forceMerge does exist in line 153 of
https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L153

I am very grateful for your advice. Thanks a lot.
​

On Mon, Jun 15, 2015 at 10:39 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Ah, OK. For very slowly changing indexes optimize can makes sense.
>
> Do note, though, that if you incrementally index after the full build, and
> especially if you update documents, you're laying a trap for the future.
> Let's
> say you optimize down to a single segment. The default TieredMergePolicy
> tries to merge "similar size segments". But now you have one huge segment
> and docs will be marked as deleted from that segment, but not cleaned up
> until that segment is merged, which won't happen for a long time since it
> is so much bigger (I'm assuming) than the segments the incremental indexing
> will create.
>
> Now, the percentage of deleted documents weighs quite heavily in the
> decision
> what segments to merge, so it might not matter. It's just something to
> be aware of.
> Surely benchmarking is in order as you indicated.
>
> The Lucene-level IndexWriter.forceMerge method seems to be what you need
> though, although if you're working over HDFS I'm in unfamiliar territory.
> But
> the constructors to IndexWriter take a Directory, and the HdfsDirectory
> extends BaseDirectory which extends Directory so if you can set up
> an HdfsDIrectory it should "just work". I haven't personally tried it
> though.
>
> I saw something recently where optimization helped considerably in a
> sharded situation where the rows parameter was 400 (10 shards). My
> belief is that what was really happening was that the first-pass of a
> distributed search was getting slowed by disk seeks across multiple
> smaller segments. I'm waiting for SOLR-6810 which should impact that
> problem. Don't know if it applies to your situation or not though.
>
> HTH,
> Erick
>
>
> On Mon, Jun 15, 2015 at 8:30 PM, Shenghua(Daniel) Wan
> <wansheng...@gmail.com> wrote:
> > Hi, Erick,
> > First thanks for sharing the ideas. I am further giving more context here
> > accordingly.
> >
> > 1. why optimize? I have done some experiments to compare the query
> response
> > time, and there is some difference. In addition, the searcher will be
> > customer-facing. I think any performance boost will be worthwhile unless
> > the indexing will be more frequent. However, more benchmark will be
> > necessary to quantize the margin.
> >
> > 2. Why embedded solr server? I adopted the idea from Mark Miller's
> > map-reduce indexing and build on top of its original contribution to
> Solr.
> > It launches an embedded solr server at the end of reducer stages.
> Basically
> > a solr "instance" is brought up and fed with documents. Then the index is
> > generated at each reducer. Then the indexes are merged, and optimized if
> > desired.
> >
> > Thanks.
> >
> > On Mon, Jun 15, 2015 at 5:06 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> The first question is why you're optimizing at all. It's not recommended
> >> unless you can demonstrate that an optimized index is giving you enough
> >> of a performance boost to be worth the effort.
> >>
> >> And why are you using embedded solr server? That's kind of unusual
> >> so I wonder if you've gone down a wrong path somewhere. In other
> >> words this feels like an XY problem, you're specifically asking about
> >> a task without explaining the problem you're trying to solve, there may
> >> be better alternatives.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan
> >> <wansheng...@gmail.com> wrote:
> >> > Hi,
> >> > Do you have any suggestions to improve the performance for merging and
> >> > optimizing index?
> >> > I have been using embedded solr server to merge and optimize the
> index. I
> >> > am looking for the right parameters to tune. My use case have about
> 300
> >> > fields plus 250 copyfields, and moderate doc size (about 65K each doc
> >> > averagely)
> >> >
> >> > https://wiki.apache.org/solr/MergingSolrIndexes does not help much.
> >> >
> >> > Thanks a lot for any ideas and suggestions.
> >> >
> >> > --
> >> >
> >> > Regards,
> >> > Shenghua (Daniel) Wan
> >>
> >
> >
> >
> > --
> >
> > Regards,
> > Shenghua (Daniel) Wan
>



-- 

Regards,
Shenghua (Daniel) Wan

Reply via email to