I am happy to report that <1> fixed these: PERFORMANCE WARNING: Overlapping onDeckSearchers=2
We still occasionnally see timeouts so we may have to explore <2>. On Thu, Oct 26, 2017 at 12:12 PM, Fengtan <fengtan...@gmail.com> wrote: > Thanks Erick and Emir -- we are going to start with <1> and possibly <2>. > > On Thu, Oct 26, 2017 at 7:06 AM, Emir Arnautović < > emir.arnauto...@sematext.com> wrote: > >> Hi Fengtan, >> I would just add that when merging collections, you might want to use >> document routing (https://lucene.apache.org/sol >> r/guide/6_6/shards-and-indexing-data-in-solrcloud.html#Shard >> sandIndexingDatainSolrCloud-DocumentRouting < >> https://lucene.apache.org/solr/guide/6_6/shards-and-indexin >> g-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting>) >> - since you are keeping separate collections, I guess you have a >> “collection ID” to use as routing key. This will enable you to have one >> collection but query only shard(s) with data from one “collection”. >> >> HTH, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >> > On 25 Oct 2017, at 19:25, Erick Erickson <erickerick...@gmail.com> >> wrote: >> > >> > <1> It's not the explicit commits are expensive, it's that they happen >> > too fast. An explicit commit and an internal autocommit have exactly >> > the same cost. Your "overlapping ondeck searchers" is definitely an >> > indication that your commits are happening from somwhere too quickly >> > and are piling up. >> > >> > <2> Likely a good thing, each collection increases overhead. And >> > 1,000,000 documents is quite small in Solr's terms unless the >> > individual documents are enormous. I'd do this for a number of >> > reasons. >> > >> > <3> Certainly an option, but I'd put that last. Fix the commit problem >> first ;) >> > >> > <4> If you do this, make the autowarm count quite small. That said, >> > this will be very little use if you have frequent commits. Let's say >> > you commit every second. The autowarming will warm caches, which will >> > then be thrown out a second later. And will increase the time it takes >> > to open a new searcher. >> > >> > <5> Yeah, this would probably just be a band-aid. >> > >> > If I were prioritizing these, I'd do >> > <1> first. If you control the client, just don't call commit. If you >> > do not control the client, then what you've outlined is fine. Tip: set >> > your soft commit settings to be as long as you can stand. If you must >> > have very short intervals, consider disabling your caches completely. >> > Here's a long article on commits.... >> > https://lucidworks.com/2013/08/23/understanding-transaction- >> logs-softcommit-and-commit-in-sorlcloud/ >> > >> > <2> Actually, this and <1> are pretty close in priority. >> > >> > Then re-evaluate. Fixing the commit issue may buy you quite a bit of >> > time. Having 1,000 collections is pushing the boundaries presently. >> > Each collection will establish watchers on the bits it cares about in >> > ZooKeeper, and reducing the watchers by a factor approaching 1,000 is >> > A Good Thing. >> > >> > Frankly, between these two things I'd pretty much expect your problems >> > to disappear. wouldn't be the first time I've been totally wrong, but >> > it's where I'd start ;) >> > >> > Best, >> > Erick >> > >> > On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <fengtan...@gmail.com> wrote: >> >> Hi, >> >> >> >> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's. >> >> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; >> each >> >> VM has one Solr and one ZK instance. >> >> The cluster hosts 1,000 collections ; each collection has 1 shard and >> >> between 500 and 50,000 documents. >> >> Documents are indexed incrementally every day ; the Solr client mostly >> does >> >> searching. >> >> Solr runs with -Xms7g -Xmx7g. >> >> >> >> Everything has been working fine for about one month but a few days >> ago we >> >> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm >> >> >> >> Also we have always seen these: >> >> PERFORMANCE WARNING: Overlapping onDeckSearchers=2 >> >> >> >> >> >> We are not sure what is causing the timeouts, although we have >> identified a >> >> few things that could be improved: >> >> >> >> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProc >> essorFactory >> >> -- we are aware that explicit commits are expensive >> >> >> >> 2) Drop the 1,000 collections and use a single one instead (all our >> >> collections use the same schema/solrconfig.xml) since stability >> problems >> >> are expected when the number of collections reaches the low hundreds >> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The >> >> downside is that the new collection would contain 1,000,000 documents >> which >> >> may bring new challenges. >> >> >> >> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring >> a >> >> better performance according to this >> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pau >> se_problems>, >> >> this >> >> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_Firs >> t.29_Collector> >> >> and this >> >> <http://lucene.472066.n3.nabble.com/java-util-concurrent- >> TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>. >> >> The downside is that Lucene explicitely discourages the usage of G1 >> >> <https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_ >> various_JVMs_affecting_Lucene_.2F_Solr> >> >> so we are not sure what to expect. We use the default GC settings: >> >> -XX:NewRatio=3 >> >> -XX:SurvivorRatio=4 >> >> -XX:TargetSurvivorRatio=90 >> >> -XX:MaxTenuringThreshold=8 >> >> -XX:+UseConcMarkSweepGC >> >> -XX:+UseParNewGC >> >> -XX:ConcGCThreads=4 >> >> -XX:ParallelGCThreads=4 >> >> -XX:+CMSScavengeBeforeRemark >> >> -XX:PretenureSizeThreshold=64m >> >> -XX:+UseCMSInitiatingOccupancyOnly >> >> -XX:CMSInitiatingOccupancyFraction=50 >> >> -XX:CMSMaxAbortablePrecleanTime=6000 >> >> -XX:+CMSParallelRemarkEnabled >> >> -XX:+ParallelRefProcEnabled >> >> >> >> 4) Tune the caches, possibly by increasing autowarmCount on >> filterCache -- >> >> our current config is: >> >> <filterCache class="solr.FastLRUCache" size="512" initialSize="512" >> >> autowarmCount="0"/> >> >> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" >> >> autowarmCount="32"/> >> >> <documentCache class="solr.LRUCache" size="512" initialSize="512" >> >> autowarmCount="0"/> >> >> >> >> 5) Tweak the timeout settings, although this would not fix the >> underlying >> >> issue >> >> >> >> >> >> Does any of these options seem relevant ? Is there anything else that >> might >> >> address the timeouts ? >> >> >> >> Thanks >> >> >