Hi Denis,
Merge works on segments and depending on merge strategy it is triggered 
separately so there is no some queue between update executor and merge threads.

Re SPM - I am using it on a daily bases for most of my consulting work and if 
you have SPM app you can invite me to it and I’ll take a quick look to see if 
there are some obvious bottlenecks.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 23 Apr 2018, at 23:37, Denis Demichev <demic...@gmail.com> wrote:
> 
> I conducted another experiment today with local SSD drives, but this did not 
> seem to fix my problem.
> Don't see any extensive I/O in this case:
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> xvda              1.76        88.83         5.52    1256191      77996
> xvdb             13.95       111.30     56663.93    1573961  801303364
> 
> xvdb - is the device where SolrCloud is installed and data files are kept.
> 
> What I see:
> - There are 17 "Lucene Merge Thread #..." running. Some of them are blocked, 
> some of them are RUNNING
> - updateExecutor-N-thread-M threads are in parked mode and number of docs 
> that I am able to submit is still low
> - Tried to change maxIndexingThreads, set it to something high. This seems to 
> prolong the time when cluster is accepting new indexing requests and keeps 
> CPU utilization a lot higher while the cluster is merging indexes
> 
> Could anyone please point me to the right direction (documentation or Java 
> classes) where I can read about how data is passed from updateExecutor thread 
> pool to Merge Threads? I assume there should be some internal blocking queue 
> or something similar.
> Still cannot wrap my head around how Solr blocks incoming connections. Non 
> merged indexes are not kept in memory so I don't clearly understand why Solr 
> cannot keep writing index file to HDD while other threads are merging indexes 
> (since this is a continuous process anyway).
> 
> Does anyone use SPM monitoring tool for that type of problems? Is it of any 
> use at all?
> 
> 
> Thank you in advance.
> 
> 
> 
> 
> Regards,
> Denis
> 
> 
> On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev <demic...@gmail.com 
> <mailto:demic...@gmail.com>> wrote:
> Mikhail,
> 
> Sure, I will keep everyone posted. Moving to non-HVM instance may take some 
> time, so hopefully I will be able to share my observations in the next couple 
> of days or so.
> Thanks again for all the help.
> 
> Regards,
> Denis
> 
> 
> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev <m...@apache.org 
> <mailto:m...@apache.org>> wrote:
> Denis, please let me know what it ends up with. I'm really curious regarding 
> this case and AWS instace flavours. fwiw since 7.4 we'll have 
> ioThrottle=false option. 
> 
> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev <demic...@gmail.com 
> <mailto:demic...@gmail.com>> wrote:
> Mikhail, Erick,
> 
> Thank you.
> 
> What just occurred to me - we don't use local SSD but instead we're using EBS 
> volumes.
> This was a wrong instance type that I looked at.
> Will try to set up a cluster with SSD nodes and retest.
> 
> Regards,
> Denis
> 
> 
> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev <m...@apache.org 
> <mailto:m...@apache.org>> wrote:
> I'm not sure it's the right context, but here is one guy shows really low 
> throthle boundary 
> https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348
>  
> <https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348>
> 
> 
> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev <m...@apache.org 
> <mailto:m...@apache.org>> wrote:
> Threads are hanging on merge io throthling 
>         at 
> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
>         at 
> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
>         at 
> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
>         at 
> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
> It seems odd. Please confirm that you don't commit on every update request. 
> The only way to monitor io throthling is to enable infostream and read a lot 
> of logs.
>    
> 
> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com 
> <mailto:demic...@gmail.com>> wrote:
> Erick,
> 
> Thank you for your quick response.
> 
> I/O bottleneck: Please see another screenshot attached, as you can see disk 
> r/w operations are pretty low or not significant.
> iostat==========
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           12.52    0.00    0.00    0.00    0.00   87.48
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           12.51    0.00    0.00    0.00    0.00   87.49
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> ==========================
> 
> Merging threads: I don't see any modifications of a merging policy comparing 
> to the default solrconfig.
> Index config: 
> <ramBufferSizeMB>2000</ramBufferSizeMB><maxBufferedDocs>500000</maxBufferedDocs>
> Update handler: <updateHandler class="solr.DirectUpdateHandler2">
> Could you please help me understand how can I validate this theory?
> Another note here. Even if I remove the stress from the cluster I still see 
> that merging thread is consuming CPU for some time. It may take hours and if 
> I try to return the stress back nothing changes.
> If this is overloaded merging process, it should take some time to reduce the 
> queue length and it should start accepting new indexing requests.
> Maybe I am wrong, but I need some help to understand how to check it.
> 
> AWS - Sorry, I don't have any physical hardware to replicate this test locally
> 
> GC - I monitored GC closely. If you take a look at CPU utilization screenshot 
> you will see a blue graph that is GC consumption. In addition to that I am 
> using Visual GC plugin from VisualVM to understand how GC performs under the 
> stress and don't see any anomalies.
> There are several GC pauses from time to time but those are not significant. 
> Heap utilization graph tells me that GC is not struggling a lot.
> 
> Thank you again for your comments, hope the information above will help you 
> understand the problem.
> 
> 
> Regards,
> Denis
> 
> 
> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <erickerick...@gmail.com 
> <mailto:erickerick...@gmail.com>> wrote:
> Have you changed any of the merge policy parameters? I doubt it but just 
> asking.
> 
> My guess: your I/O is your bottleneck. There are a limited number of
> threads (tunable) that are used for background merging. When they're
> all busy, incoming updates are queued up. This squares with your
> statement that queries are fine and CPU activity is moderate.
> 
> A quick test there would be to try this on a non-AWS setup if you have
> some hardware you can repurpose.
> 
> an 80G heap is a red flag. Most of the time that's too large by far.
> So one thing I'd do is hook up some GC monitoring, you may be spending
> a horrible amount of time in GC cycles.
> 
> Best,
> Erick
> 
> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com 
> <mailto:demic...@gmail.com>> wrote:
> >
> > All,
> >
> > I would like to request some assistance with a situation described below. My
> > SolrCloud cluster accepts the update requests at a very low pace making it
> > impossible to index new documents.
> >
> > Cluster Setup:
> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical memory,
> > 80GB Java Heap space, AWS
> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build
> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
> >
> > Symptoms:
> > 1. 4 instances running 4 threads each are using SolrJ client to submit
> > documents to SolrCloud for indexing, do not perform any manual commits. Each
> > document  batch is 10 documents big, containing ~200 text fields per
> > document.
> > 2. After some time (~20-30 minutes, by that time I see only ~50-60K of
> > documents in a collection, node restarts do not help) I notice that clients
> > cannot submit new documents to the cluster for indexing anymore, each
> > operation takes enormous amount of time
> > 3. Cluster is not loaded at all, CPU consumption is moderate (I am seeing
> > that merging is performed all the time though), memory consumption is
> > adequate, but still updates are not accepted from external clients
> > 4. Search requests are handled fine
> > 5. I don't see any significant activity in SolrCloud logs anywhere, just
> > regular replication attempts only. No errors.
> >
> >
> > Additional information
> > 1. Please see Thread Dump attached.
> > 2. Please see SolrAdmin info with physical memory and file descriptor
> > utilization
> > 3. Please see VisualVM screenshots with CPU and memory utilization and CPU
> > profiling data. Physical memory utilization is about 60-70 percent all the
> > time.
> > 4. Schema file contains ~10 permanent fields 5 of which are mapped and
> > mandatory and persisted, the rest of the fields are optional and dynamic
> > 5. Solr config configures autoCommit to be set to 2 minutes and openSearcher
> > set to false
> > 6. Caches are set up with autoWarmCount = 0
> > 7. GC was fine tuned and I don't see any significant CPU utilization by GC
> > or any lengthy pauses. Majority of the garbage is collected in young gen
> > space.
> >
> > My primary question: I see that the cluster is alive and performs some
> > merging and commits but does not accept new documents for indexing. What is
> > causing this slowdown and why it does not accept new submissions?
> >
> >
> > Regards,
> > Denis
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

Reply via email to