Hi My team has tasked me with upgrading Solr from the version we are using (5.4) to the latest stable version 6.6. I am stuck for a few days now on the indexing part.
First I'll list the requirements, then all the configuration settings I have tried. So in total I'm indexing about 2.5million documents. The average document size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute). System specifications: RAM: 120G Processors: 16 Solr configuration: Heap size: 80G ------------------------------------------------------------------------------------------------------------ solrconfig.xml: (Relevant parts; please let me know if there's anything else you would like to look at) <autoCommit> <maxDocs>10000</maxDocs> <maxTime>3800000</maxTime> <openSearcher>true</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> </autoSoftCommit> <ramBufferSizeMB>5000</ramBufferSizeMB> <maxBufferedDocs>10000</maxBufferedDocs> <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">30</int> <int name="segmentsPerTier">30</int> </mergePolicyFactory> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">8</int> <int name="maxThreadCount">7</int> </mergeScheduler> ------------------------------------------------------------------------------------------------------------ The main problem: When I start indexing everything is good until I reach about 2 million docs, which takes ~10 hours. But then the commitscheduler thread gets blocked. It is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs from InfoStream, I found "too many merges; stalling" message from the commitscheduler thread, post which it gets stuck in the while loop forever. Here's the check that's stalling our commitscheduler thread. while (writer.hasPendingMerges() && mergeThreadCount() >= maxMergeCount) { .. .. if (verbose() && startStallTime == 0) { message(" too many merges; stalling..."); } startStallTime = System.currentTimeMillis(); doStall(); } } This is the reason I have put maxMergeCount and maxThreadCount explicitly in my solrconfig. I thought increasing the number of threads would make sure that there is always one extra thread for commit to go through. But now that I have increased the allowed number of threads, Lucene just spawns that many "Lucene Merge Thread"s and leaves none for when a commit comes along and triggers a merge. And then it gets stuck forever. Well, not really forever, I'm guessing that once one of the merging threads is removed (by using removeMergeThread() in CMS) the commit will go through, but for some reason, the merging is so slow that this doesn't happen (I gave this a couple of hours of time, but commit thread was still stuck). Which brings us to the second problem. ------------------------------------------------------------------------------------------------------------ The second problem: Merging is extremely slow. I'm not sure what I'm missing here. Maybe there's a change in 6.x version which has significantly hampered merging speed. From the thread dump, what I can see is that "Lucene Merge Thread"s are in the Runnable state, and at TreeMap.getEntry() call. Is this normal? Another thing I noticed was that the disk IO is throttled at ~20Mb/s. But I'm not sure if this is something that can actually hamper merging. My index size was ~10GB and I left it overnight (~6hours) and almost no merging happened. Here's another infoStream message from logs. Just putting it here in case it helps. ----- 2017-09-06 14:11:07.921 INFO (qtp834133664-115) [ x:collection1] o.a.s.u.LoggingInfoStream [MS][qtp834133664-115]: updateMergeThreads ioThrottle=true targetMBPerSec=23.6 MB/sec merge thread Lucene Merge Thread #4 estSize=5116.1 MB (written=4198.1 MB) runTime=8100.1s (stopped=0.0s, paused=142.5s) rate=19.7 MB/sec now change from 19.7 MB/sec to 23.6 MB/sec merge thread Lucene Merge Thread #7 estSize=1414.3 MB (written=0.0 MB) runTime=0.0s (stopped=0.0s, paused=0.0s) rate=23.6 MB/sec leave running at 23.6 MB/sec merge thread Lucene Merge Thread #5 estSize=1014.4 MB (written=427.2 MB) runTime=6341.9s (stopped=0.0s, paused=12.3s) rate=19.7 MB/sec now change from 19.7 MB/sec to 23.6 MB/sec merge thread Lucene Merge Thread #3 estSize=752.8 MB (written=362.8 MB) runTime=8100.1s (stopped=0.0s, paused=12.4s) rate=19.7 MB/sec now change from 19.7 MB/sec to 23.6 MB/sec merge thread Lucene Merge Thread #2 estSize=312.5 MB (written=151.9 MB) runTime=8100.7s (stopped=0.0s, paused=8.7s) rate=19.7 MB/sec now change from 19.7 MB/sec to 23.6 MB/sec merge thread Lucene Merge Thread #6 estSize=87.7 MB (written=63.0 MB) runTime=3627.8s (stopped=0.0s, paused=0.9s) rate=19.7 MB/sec now change from 19.7 MB/sec to 23.6 MB/sec merge thread Lucene Merge Thread #1 estSize=57.3 MB (written=21.7 MB) runTime=8101.2s (stopped=0.0s, paused=0.2s) rate=19.7 MB/sec now change from 19.7 MB/sec to 23.6 MB/sec merge thread Lucene Merge Thread #0 estSize=4.6 MB (written=0.0 MB) runTime=8101.0s (stopped=0.0s, paused=0.0s) rate=unlimited leave running at Infinity MB/sec ----- I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and then to 30, in hopes of having fewer merging threads to be running at once, but that just results in more segments to be created (not sure why this would happen). I also tried going the other way by reducing it to 5, but that experiment failed quickly (commit thread blocked). I increased the ramBufferSizeMB to 5000MB so that there are fewer flushes happening, so that fewer segments are created, so that fewer merges happen (I haven't dug deep here, so please correct me if this is something I should revert. Our current (5.x) config has this set at 324MB). The autoCommit and autoSoftCommit settings look good to me, as I've turned of softCommits, and I am autoCommitting at 10000 docs (every 5-10 minutes), which finishes smoothly, unless it gets stuck in the first problem described above. Questions: 1a. Why is Lucene spawning so many merging threads? 1b. How can I make sure that there's always room for the Commit thread to go through? 1c. Are all MergeThreads in runnable state at Treemap.getEntry() is normal? 2a. Is merging slower in 6.x than 5.x? 2b. What can I do to make it go faster? 2c. Could disk IO throttling be an issue? If so, how can I resolve it? I tried providing ioThrottle=false in solrconfig but that just throws an error. I have been trying different things for a week now. Please let me know if there's anything else I can read/try other than the things I have pointed above. Regards Yasoob Haider -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html