Hi

My team has tasked me with upgrading Solr from the version we are using
(5.4) to the latest stable version 6.6. I am stuck for a few days now on the
indexing part.

First I'll list the requirements, then all the configuration settings I have
tried.

So in total I'm indexing about 2.5million documents. The average document
size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting
Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute).

System specifications:
RAM: 120G
Processors: 16

Solr configuration:
Heap size: 80G

------------------------------------------------------------------------------------------------------------
solrconfig.xml: (Relevant parts; please let me know if there's anything else
you would like to look at)

<autoCommit>
      <maxDocs>10000</maxDocs>
      <maxTime>3800000</maxTime>
      <openSearcher>true</openSearcher>
</autoCommit>

<autoSoftCommit>
      <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

<ramBufferSizeMB>5000</ramBufferSizeMB>
<maxBufferedDocs>10000</maxBufferedDocs>

<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
      <int name="maxMergeAtOnce">30</int>
      <int name="segmentsPerTier">30</int>
</mergePolicyFactory>

<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
      <int name="maxMergeCount">8</int>
      <int name="maxThreadCount">7</int>
</mergeScheduler>

------------------------------------------------------------------------------------------------------------

The main problem:

When I start indexing everything is good until I reach about 2 million docs,
which takes ~10 hours. But then the  commitscheduler thread gets blocked. It
is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs
from InfoStream, I found "too many merges; stalling" message from the
commitscheduler thread, post which it gets stuck in the while loop forever.

Here's the check that's stalling our commitscheduler thread.

while (writer.hasPendingMerges() && mergeThreadCount() >= maxMergeCount) {
..
..
      if (verbose() && startStallTime == 0) {
        message("    too many merges; stalling...");
      }
      startStallTime = System.currentTimeMillis();
      doStall();
    }
}

This is the reason I have put maxMergeCount and maxThreadCount explicitly in
my solrconfig. I thought increasing the number of threads would make sure
that there is always one extra thread for commit to go through. But now that
I have increased the allowed number of threads, Lucene just spawns that many
"Lucene Merge Thread"s and leaves none for when a commit comes along and
triggers a merge. And then it gets stuck forever.

Well, not really forever, I'm guessing that once one of the merging threads
is removed (by using removeMergeThread() in CMS) the commit will go through,
but for some reason, the merging is so slow that this doesn't happen (I gave
this a couple of hours of time, but commit thread was still stuck). Which
brings us to the second problem.

------------------------------------------------------------------------------------------------------------

The second problem:
Merging is extremely slow. I'm not sure what I'm missing here. Maybe there's
a change in 6.x version which has significantly hampered merging speed. From
the thread dump, what I can see is that "Lucene Merge Thread"s are in the
Runnable state, and at TreeMap.getEntry() call. Is this normal?

Another thing I noticed was that the disk IO is throttled at ~20Mb/s. But
I'm not sure if this is something that can actually hamper merging.

My index size was ~10GB and I left it overnight (~6hours) and almost no
merging happened.

Here's another infoStream message from logs. Just putting it here in case it
helps.

-----

2017-09-06 14:11:07.921 INFO  (qtp834133664-115) [   x:collection1]
o.a.s.u.LoggingInfoStream [MS][qtp834133664-115]: updateMergeThreads
ioThrottle=true targetMBPerSec=23.6 MB/sec
merge thread Lucene Merge Thread #4 estSize=5116.1 MB (written=4198.1 MB)
runTime=8100.1s (stopped=0.0s, paused=142.5s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #7 estSize=1414.3 MB (written=0.0 MB)
runTime=0.0s (stopped=0.0s, paused=0.0s) rate=23.6 MB/sec
  leave running at 23.6 MB/sec
merge thread Lucene Merge Thread #5 estSize=1014.4 MB (written=427.2 MB)
runTime=6341.9s (stopped=0.0s, paused=12.3s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #3 estSize=752.8 MB (written=362.8 MB)
runTime=8100.1s (stopped=0.0s, paused=12.4s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #2 estSize=312.5 MB (written=151.9 MB)
runTime=8100.7s (stopped=0.0s, paused=8.7s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #6 estSize=87.7 MB (written=63.0 MB)
runTime=3627.8s (stopped=0.0s, paused=0.9s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #1 estSize=57.3 MB (written=21.7 MB)
runTime=8101.2s (stopped=0.0s, paused=0.2s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #0 estSize=4.6 MB (written=0.0 MB)
runTime=8101.0s (stopped=0.0s, paused=0.0s) rate=unlimited
  leave running at Infinity MB/sec

-----

I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and
then to 30, in hopes of having fewer merging threads to be running at once,
but that just results in more segments to be created (not sure why this
would happen). I also tried going the other way by reducing it to 5, but
that experiment failed quickly (commit thread blocked).

I increased the ramBufferSizeMB to 5000MB so that there are fewer flushes
happening, so that fewer segments are created, so that fewer merges happen
(I haven't dug deep here, so please correct me if this is something I should
revert. Our current (5.x) config has this set at 324MB).

The autoCommit and autoSoftCommit settings look good to me, as I've turned
of softCommits, and I am autoCommitting at 10000 docs (every 5-10 minutes),
which finishes smoothly, unless it gets stuck in the first problem described
above.


Questions:
1a. Why is Lucene spawning so many merging threads?
1b. How can I make sure that there's always room for the Commit thread to go
through?
1c. Are all MergeThreads in runnable state at Treemap.getEntry() is normal?

2a. Is merging slower in 6.x than 5.x?
2b. What can I do to make it go faster?
2c. Could disk IO throttling be an issue? If so, how can I resolve it? I
tried providing ioThrottle=false in solrconfig but that just throws an
error.

I have been trying different things for a week now. Please let me know if
there's anything else I can read/try other than the things I have pointed
above.

Regards
Yasoob Haider



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to