New segments are created when
1> the RAMBufferSizeMB is exceeded
or
2> a commit happens.

The maximum segment size defaults to 5G, but TieredMergePolicy can be 
configured in solrconfig.xml to have larger max sizes by setting 
maxMergedSegmentMB

Depending on your indexing rate, requiring commits every 100K records may be 
too frequent, I have no idea what your indexing rate is. In general I prefer a 
time based autocommit policy. Say, for some reason, you stop indexing after 50K 
records. They’ll never be searchable unless you have a time-based commit. 
Besides, it’s much easier to explain to users “it may take 60 seconds for your 
doc to be searchable” than “well, depending on the indexing rate, it may be 
between 10 seconds and 6 hours for your docs to be searchable”. Of course if 
you’re indexing at a very fast rate, that may not matter.

There’s no such thing as low disk read during segment merging”. If 5 segments 
need to be read, they all must be read in their entirety and the new segment 
must be completely written out. At best you can try to cut down on the number 
of times segment merges happen, but from what you’re describing that may not be 
feasible. 

Attachments are aggressively stripped by the mail server, your graph did not 
come through.

Once a segment grows to the max size (5g by default), it is not mreged again 
unless and until it accumulates quite a number of deleted documents. So one 
question is whether you update existing documents frequently. Is that the case? 
If not, then the index size really shouldn’t matter and your problem is 
something else.

And I sincerely hope that part of your indexing does _NOT_ include 
optimize/forcemerge or expungeDeletes. Those are very expensive operations, and 
prior to Solr 7.5 would leave your index in an awkward state, see: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/.
 There’s a link for how this is different in Solr 7.5+ in that article.

But something smells fishy about this situation. Segment merging is typically 
not very noticeable. Perhaps you just have too much data on too small hardware? 
You’ve got some evidence that segment merging is the root cause, but I wonder 
if what’s happening is you’re just swapping instead? Segment merging will 
certainly increase the I/O pressure, but by and large that shouldn’t really 
affect search speed if the OS memory space is large enough to hold the 
important portions of your index. If the OS isn’t large enough, the additional 
I/O pressure from merging may be enough to start your system swapping which is 
A Bad Thing.

See: https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html 
for how Lucene uses MMapDirectory...

Best,
Erick

> On Jun 6, 2020, at 11:29 AM, Anshuman Singh <singhanshuma...@gmail.com> wrote:
> 
> Hi Eric, 
> 
> We are looking into TLOG/PULL replicas. But I have some doubts regarding 
> segments. Can you explain what causes creation of a new segment and how large 
> it can grow?
> And this is my index config:
> maxMergeAtOnce - 20
> segmentsPerTier - 20
> ramBufferSizeMB - 512 MB
> 
> Can I configure these settings optimally for low disk read during segment 
> merging? Like increasing segmentsPerTier may help but a large number of 
> segments may impact search. And as per the documentation, ramBufferSizeMB can 
> trigger segment merging so maybe that can be tweaked.
> 
> One more question:
> This graph is representing index time wrt core size (0-100G). Commits were 
> happening automatically at every 100k records.
> 
> 
> 
> As you can see the density of spikes is increasing as the core size is 
> increasing. When our core size becomes ~100 G, indexing becomes really slow. 
> Why is this happening? Do we need to put a limit on how large each core can 
> grow?
> 
> 
> On Fri, Jun 5, 2020 at 5:59 PM Erick Erickson <erickerick...@gmail.com> wrote:
> Have you considered TLOG/PULL replicas rather than NRT replicas? 
> That way, all the indexing happens on a single machine and you can
> use shards.preference to confine the searches happen on the PULL replicas,
> see:  https://lucene.apache.org/solr/guide/7_7/distributed-requests.html
> 
> No, you can’t really limit the number of segments. While that seems like a
> good idea, it quickly becomes counter-productive. Say you require that you
> have 10 segments. Say each one becomes 10G. What happens when the 11th
> segment is created and it’s 100M? Do you rewrite one of the 10G segments just
> to add 100M? Your problem gets worse, not better.
> 
> 
> Best,
> Erick
> 
> > On Jun 5, 2020, at 1:41 AM, Anshuman Singh <singhanshuma...@gmail.com> 
> > wrote:
> > 
> > Hi Nicolas,
> > 
> > Commit happens automatically at 100k documents. We don't commit explicitly.
> > We didn't limit the number of segments. There are 35+ segments in each core.
> > But unrelated to the question, I would like to know if we can limit the
> > number of segments in the core. I tried it in the past but the merge
> > policies don't allow that.
> > The TieredMergePolicy has two parameters, maxMergeAtOnce and
> > segmentsPerTier. It seems like we cannot control the total number of
> > segments but only the segments per tier.(
> > http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
> > )
> > 
> > 
> > On Thu, Jun 4, 2020 at 5:48 PM Nicolas Franck <nicolas.fra...@ugent.be>
> > wrote:
> > 
> >> The real questions are:
> >> 
> >> * how much often do you commit (either explicitly or automatically)?
> >> * how much segments do you allow? If you only allow 1 segment,
> >>  then that whole segment is recreated using the old documents and the
> >> updates.
> >>  And yes, that requires reading the old segment.
> >>  It is common to allow multiple segments when you update often,
> >>  so updating does not interfere with reading the index too often.
> >> 
> >> 
> >>> On 4 Jun 2020, at 14:08, Anshuman Singh <singhanshuma...@gmail.com>
> >> wrote:
> >>> 
> >>> I noticed that while indexing, when commit happens, there is high disk
> >> read
> >>> by Solr. The problem is that it is impacting search performance when the
> >>> index is loaded from the disk with respect to the query, as the disk read
> >>> speed is not quite good and the whole index is not cached in RAM.
> >>> 
> >>> When no searching is performed, I noticed that disk is usually read
> >> during
> >>> commit operations and sometimes even without commit at low rate. I guess
> >> it
> >>> is read due to segment merge operations. Can it be something else?
> >>> If it is merging, can we limit disk IO during merging?
> >> 
> >> 
> 

Reply via email to