I am happy to report that <1> fixed these:
  PERFORMANCE WARNING: Overlapping onDeckSearchers=2

We still occasionnally see timeouts so we may have to explore <2>.





On Thu, Oct 26, 2017 at 12:12 PM, Fengtan <fengtan...@gmail.com> wrote:

> Thanks Erick and Emir -- we are going to start with <1> and possibly <2>.
>
> On Thu, Oct 26, 2017 at 7:06 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Fengtan,
>> I would just add that when merging collections, you might want to use
>> document routing (https://lucene.apache.org/sol
>> r/guide/6_6/shards-and-indexing-data-in-solrcloud.html#Shard
>> sandIndexingDatainSolrCloud-DocumentRouting <
>> https://lucene.apache.org/solr/guide/6_6/shards-and-indexin
>> g-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting>)
>> - since you are keeping separate collections, I guess you have a
>> “collection ID” to use as routing key. This will enable you to have one
>> collection but query only shard(s) with data from one “collection”.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 25 Oct 2017, at 19:25, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> >
>> > <1> It's not the explicit commits are expensive, it's that they happen
>> > too fast. An explicit commit and an internal autocommit have exactly
>> > the same cost. Your "overlapping ondeck searchers"  is definitely an
>> > indication that your commits are happening from somwhere too quickly
>> > and are piling up.
>> >
>> > <2> Likely a good thing, each collection increases overhead. And
>> > 1,000,000 documents is quite small in Solr's terms unless the
>> > individual documents are enormous. I'd do this for a number of
>> > reasons.
>> >
>> > <3> Certainly an option, but I'd put that last. Fix the commit problem
>> first ;)
>> >
>> > <4> If you do this, make the autowarm count quite small. That said,
>> > this will be very little use if you have frequent commits. Let's say
>> > you commit every second. The autowarming will warm caches, which will
>> > then be thrown out a second later. And will increase the time it takes
>> > to open a new searcher.
>> >
>> > <5> Yeah, this would probably just be a band-aid.
>> >
>> > If I were prioritizing these, I'd do
>> > <1> first. If you control the client, just don't call commit. If you
>> > do not control the client, then what you've outlined is fine. Tip: set
>> > your soft commit settings to be as long as you can stand. If you must
>> > have very short intervals, consider disabling your caches completely.
>> > Here's a long article on commits....
>> > https://lucidworks.com/2013/08/23/understanding-transaction-
>> logs-softcommit-and-commit-in-sorlcloud/
>> >
>> > <2> Actually, this and <1> are pretty close in priority.
>> >
>> > Then re-evaluate. Fixing the commit issue may buy you quite a bit of
>> > time. Having 1,000 collections is pushing the boundaries presently.
>> > Each collection will establish watchers on the bits it cares about in
>> > ZooKeeper, and reducing the watchers by a factor approaching 1,000 is
>> > A Good Thing.
>> >
>> > Frankly, between these two things I'd pretty much expect your problems
>> > to disappear. wouldn't be the first time I've been totally wrong, but
>> > it's where I'd start ;)
>> >
>> > Best,
>> > Erick
>> >
>> > On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <fengtan...@gmail.com> wrote:
>> >> Hi,
>> >>
>> >> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
>> >> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ;
>> each
>> >> VM has one Solr and one ZK instance.
>> >> The cluster hosts 1,000 collections ; each collection has 1 shard and
>> >> between 500 and 50,000 documents.
>> >> Documents are indexed incrementally every day ; the Solr client mostly
>> does
>> >> searching.
>> >> Solr runs with -Xms7g -Xmx7g.
>> >>
>> >> Everything has been working fine for about one month but a few days
>> ago we
>> >> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm
>> >>
>> >> Also we have always seen these:
>> >>  PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>> >>
>> >>
>> >> We are not sure what is causing the timeouts, although we have
>> identified a
>> >> few things that could be improved:
>> >>
>> >> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProc
>> essorFactory
>> >> -- we are aware that explicit commits are expensive
>> >>
>> >> 2) Drop the 1,000 collections and use a single one instead (all our
>> >> collections use the same schema/solrconfig.xml) since stability
>> problems
>> >> are expected when the number of collections reaches the low hundreds
>> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The
>> >> downside is that the new collection would contain 1,000,000 documents
>> which
>> >> may bring new challenges.
>> >>
>> >> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring
>> a
>> >> better performance according to this
>> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pau
>> se_problems>,
>> >> this
>> >> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_Firs
>> t.29_Collector>
>> >> and this
>> >> <http://lucene.472066.n3.nabble.com/java-util-concurrent-
>> TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>.
>> >> The downside is that Lucene explicitely discourages the usage of G1
>> >> <https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_
>> various_JVMs_affecting_Lucene_.2F_Solr>
>> >> so we are not sure what to expect. We use the default GC settings:
>> >>  -XX:NewRatio=3
>> >>  -XX:SurvivorRatio=4
>> >>  -XX:TargetSurvivorRatio=90
>> >>  -XX:MaxTenuringThreshold=8
>> >>  -XX:+UseConcMarkSweepGC
>> >>  -XX:+UseParNewGC
>> >>  -XX:ConcGCThreads=4
>> >>  -XX:ParallelGCThreads=4
>> >>  -XX:+CMSScavengeBeforeRemark
>> >>  -XX:PretenureSizeThreshold=64m
>> >>  -XX:+UseCMSInitiatingOccupancyOnly
>> >>  -XX:CMSInitiatingOccupancyFraction=50
>> >>  -XX:CMSMaxAbortablePrecleanTime=6000
>> >>  -XX:+CMSParallelRemarkEnabled
>> >>  -XX:+ParallelRefProcEnabled
>> >>
>> >> 4) Tune the caches, possibly by increasing autowarmCount on
>> filterCache --
>> >> our current config is:
>> >>  <filterCache class="solr.FastLRUCache" size="512" initialSize="512"
>> >> autowarmCount="0"/>
>> >>  <queryResultCache class="solr.LRUCache" size="512" initialSize="512"
>> >> autowarmCount="32"/>
>> >>  <documentCache class="solr.LRUCache" size="512" initialSize="512"
>> >> autowarmCount="0"/>
>> >>
>> >> 5) Tweak the timeout settings, although this would not fix the
>> underlying
>> >> issue
>> >>
>> >>
>> >> Does any of these options seem relevant ? Is there anything else that
>> might
>> >> address the timeouts ?
>> >>
>> >> Thanks
>>
>>
>

Reply via email to