Hi Denis, Merge works on segments and depending on merge strategy it is triggered separately so there is no some queue between update executor and merge threads.
Re SPM - I am using it on a daily bases for most of my consulting work and if you have SPM app you can invite me to it and I’ll take a quick look to see if there are some obvious bottlenecks. Thanks, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 23 Apr 2018, at 23:37, Denis Demichev <demic...@gmail.com> wrote: > > I conducted another experiment today with local SSD drives, but this did not > seem to fix my problem. > Don't see any extensive I/O in this case: > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > xvda 1.76 88.83 5.52 1256191 77996 > xvdb 13.95 111.30 56663.93 1573961 801303364 > > xvdb - is the device where SolrCloud is installed and data files are kept. > > What I see: > - There are 17 "Lucene Merge Thread #..." running. Some of them are blocked, > some of them are RUNNING > - updateExecutor-N-thread-M threads are in parked mode and number of docs > that I am able to submit is still low > - Tried to change maxIndexingThreads, set it to something high. This seems to > prolong the time when cluster is accepting new indexing requests and keeps > CPU utilization a lot higher while the cluster is merging indexes > > Could anyone please point me to the right direction (documentation or Java > classes) where I can read about how data is passed from updateExecutor thread > pool to Merge Threads? I assume there should be some internal blocking queue > or something similar. > Still cannot wrap my head around how Solr blocks incoming connections. Non > merged indexes are not kept in memory so I don't clearly understand why Solr > cannot keep writing index file to HDD while other threads are merging indexes > (since this is a continuous process anyway). > > Does anyone use SPM monitoring tool for that type of problems? Is it of any > use at all? > > > Thank you in advance. > > > > > Regards, > Denis > > > On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev <demic...@gmail.com > <mailto:demic...@gmail.com>> wrote: > Mikhail, > > Sure, I will keep everyone posted. Moving to non-HVM instance may take some > time, so hopefully I will be able to share my observations in the next couple > of days or so. > Thanks again for all the help. > > Regards, > Denis > > > On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev <m...@apache.org > <mailto:m...@apache.org>> wrote: > Denis, please let me know what it ends up with. I'm really curious regarding > this case and AWS instace flavours. fwiw since 7.4 we'll have > ioThrottle=false option. > > On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev <demic...@gmail.com > <mailto:demic...@gmail.com>> wrote: > Mikhail, Erick, > > Thank you. > > What just occurred to me - we don't use local SSD but instead we're using EBS > volumes. > This was a wrong instance type that I looked at. > Will try to set up a cluster with SSD nodes and retest. > > Regards, > Denis > > > On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev <m...@apache.org > <mailto:m...@apache.org>> wrote: > I'm not sure it's the right context, but here is one guy shows really low > throthle boundary > https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348 > > <https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348> > > > On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev <m...@apache.org > <mailto:m...@apache.org>> wrote: > Threads are hanging on merge io throthling > at > org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150) > at > org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148) > at > org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93) > at > org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78) > It seems odd. Please confirm that you don't commit on every update request. > The only way to monitor io throthling is to enable infostream and read a lot > of logs. > > > On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com > <mailto:demic...@gmail.com>> wrote: > Erick, > > Thank you for your quick response. > > I/O bottleneck: Please see another screenshot attached, as you can see disk > r/w operations are pretty low or not significant. > iostat========== > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > > avg-cpu: %user %nice %system %iowait %steal %idle > 12.52 0.00 0.00 0.00 0.00 87.48 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > > avg-cpu: %user %nice %system %iowait %steal %idle > 12.51 0.00 0.00 0.00 0.00 87.49 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > ========================== > > Merging threads: I don't see any modifications of a merging policy comparing > to the default solrconfig. > Index config: > <ramBufferSizeMB>2000</ramBufferSizeMB><maxBufferedDocs>500000</maxBufferedDocs> > Update handler: <updateHandler class="solr.DirectUpdateHandler2"> > Could you please help me understand how can I validate this theory? > Another note here. Even if I remove the stress from the cluster I still see > that merging thread is consuming CPU for some time. It may take hours and if > I try to return the stress back nothing changes. > If this is overloaded merging process, it should take some time to reduce the > queue length and it should start accepting new indexing requests. > Maybe I am wrong, but I need some help to understand how to check it. > > AWS - Sorry, I don't have any physical hardware to replicate this test locally > > GC - I monitored GC closely. If you take a look at CPU utilization screenshot > you will see a blue graph that is GC consumption. In addition to that I am > using Visual GC plugin from VisualVM to understand how GC performs under the > stress and don't see any anomalies. > There are several GC pauses from time to time but those are not significant. > Heap utilization graph tells me that GC is not struggling a lot. > > Thank you again for your comments, hope the information above will help you > understand the problem. > > > Regards, > Denis > > > On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <erickerick...@gmail.com > <mailto:erickerick...@gmail.com>> wrote: > Have you changed any of the merge policy parameters? I doubt it but just > asking. > > My guess: your I/O is your bottleneck. There are a limited number of > threads (tunable) that are used for background merging. When they're > all busy, incoming updates are queued up. This squares with your > statement that queries are fine and CPU activity is moderate. > > A quick test there would be to try this on a non-AWS setup if you have > some hardware you can repurpose. > > an 80G heap is a red flag. Most of the time that's too large by far. > So one thing I'd do is hook up some GC monitoring, you may be spending > a horrible amount of time in GC cycles. > > Best, > Erick > > On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com > <mailto:demic...@gmail.com>> wrote: > > > > All, > > > > I would like to request some assistance with a situation described below. My > > SolrCloud cluster accepts the update requests at a very low pace making it > > impossible to index new documents. > > > > Cluster Setup: > > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data > > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical memory, > > 80GB Java Heap space, AWS > > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build > > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode) > > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor > > > > Symptoms: > > 1. 4 instances running 4 threads each are using SolrJ client to submit > > documents to SolrCloud for indexing, do not perform any manual commits. Each > > document batch is 10 documents big, containing ~200 text fields per > > document. > > 2. After some time (~20-30 minutes, by that time I see only ~50-60K of > > documents in a collection, node restarts do not help) I notice that clients > > cannot submit new documents to the cluster for indexing anymore, each > > operation takes enormous amount of time > > 3. Cluster is not loaded at all, CPU consumption is moderate (I am seeing > > that merging is performed all the time though), memory consumption is > > adequate, but still updates are not accepted from external clients > > 4. Search requests are handled fine > > 5. I don't see any significant activity in SolrCloud logs anywhere, just > > regular replication attempts only. No errors. > > > > > > Additional information > > 1. Please see Thread Dump attached. > > 2. Please see SolrAdmin info with physical memory and file descriptor > > utilization > > 3. Please see VisualVM screenshots with CPU and memory utilization and CPU > > profiling data. Physical memory utilization is about 60-70 percent all the > > time. > > 4. Schema file contains ~10 permanent fields 5 of which are mapped and > > mandatory and persisted, the rest of the fields are optional and dynamic > > 5. Solr config configures autoCommit to be set to 2 minutes and openSearcher > > set to false > > 6. Caches are set up with autoWarmCount = 0 > > 7. GC was fine tuned and I don't see any significant CPU utilization by GC > > or any lengthy pauses. Majority of the garbage is collected in young gen > > space. > > > > My primary question: I see that the cluster is alive and performs some > > merging and commits but does not accept new documents for indexing. What is > > causing this slowdown and why it does not accept new submissions? > > > > > > Regards, > > Denis > > > > -- > Sincerely yours > Mikhail Khludnev > > > > -- > Sincerely yours > Mikhail Khludnev > > > > -- > Sincerely yours > Mikhail Khludnev