Zookeeper 3.4.6 is not good? That was the version recommended by Solr docs when I installed 6.2.0.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 2, 2018, at 9:30 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > > Hello S.G. > > We have relied in Trie* fields every since they became available, i don't > think reverting to the old fieldType's will do us any good, we have a very > recent problem. > > Regarding our heap, the cluster ran fine for years with just 1.5 GB, we only > recently increased it because or data keeps on growing. Heap rarely goes > higher than 50 %, except when this specific problem occurs. The nodes have no > problem processing a few hundred QPS continuously and can go on for days, > sometimes even a few weeks. > > I will keep my eye open for other clues when the problem strikes again! > > Thanks, > Markus > > -----Original message----- >> From:S G <sg.online.em...@gmail.com> >> Sent: Friday 2nd February 2018 18:20 >> To: solr-user@lucene.apache.org >> Subject: Re: 7.2.1 cluster dies within minutes after restart >> >> Yeah, definitely check the zookeeper version. >> 3.4.6 is not a good one I know and you can say the same for all the >> versions below it too. >> We have used 3.4.9 with no issues. >> While Solr 7.x uses 3.4.10 >> >> Another dimension could be the use or (dis-use) of p-fields like pint, >> plong etc. >> If you are using them, try to revert back to tint, tlong etc >> And if you are not using them, try to use them (Although doing this means a >> change from your older config and less likely to help). >> >> Lastly, did I read 2 GB for JVM heap? >> That seems really too less to me for any version of Solr >> We run with 10-16 gb of heap with G1GC collector and new-gen capped at 3-4gb >> >> >> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <markus.jel...@openindex.io> >> wrote: >> >>> Hello Ere, >>> >>> It appears that my initial e-mail [1] got lost in the thread. We don't >>> have GC issues, the cluster that dies occasionally runs, in general, smooth >>> and quick with just 2 GB allocated. >>> >>> Thanks, >>> Markus >>> >>> [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies- >>> within-minutes-after-restart-td4372615.html >>> >>> -----Original message----- >>>> From:Ere Maijala <ere.maij...@helsinki.fi> >>>> Sent: Friday 2nd February 2018 8:49 >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: 7.2.1 cluster dies within minutes after restart >>>> >>>> Markus, >>>> >>>> I may be stating the obvious here, but I didn't notice garbage >>>> collection mentioned in any of the previous messages, so here goes. In >>>> our experience almost all of the Zookeeper timeouts etc. have been >>>> caused by too long garbage collection pauses. I've summed up my >>>> observations here: >>>> <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html >>>> >>>> >>>> So, in my experience it's relatively easy to cause heavy memory usage >>>> with SolrCloud with seemingly innocent queries, and GC can become a >>>> problem really quickly even if everything seems to be running smoothly >>>> otherwise. >>>> >>>> Regards, >>>> Ere >>>> >>>> Markus Jelsma kirjoitti 31.1.2018 klo 23.56: >>>>> Hello S.G. >>>>> >>>>> We do not complain about speed improvements at all, it is clear 7.x is >>> faster than its predecessor. The problem is stability and not recovering >>> from weird circumstances. In general, it is our high load cluster >>> containing user interaction logs that suffers the most. Our main text >>> search cluster - receiving much fewer queries - seems mostly unaffected, >>> except last Sunday. After very short but high burst of queries it entered >>> the same catatonic state the logs cluster usually dies from. >>>>> >>>>> The query burst immediately caused ZK timeouts and high heap >>> consumption (not sure which came first of the latter two). The query burst >>> lasted for 30 minutes, the excessive heap consumption continued for more >>> than 8 hours, before Solr finally realized it could relax. Most remarkable >>> was that Solr recovered on its own, ZK timeouts stopped, heap went back to >>> normal. >>>>> >>>>> There seems to be a causality between high load and this state. >>>>> >>>>> We really want to get this fixed for ourselves and everyone else that >>> may encounter this problem, but i don't know how, so i need much more >>> feedback and hints from those who have deep understanding of inner working >>> of Solrcloud and changes since 6.x. >>>>> >>>>> To be clear, we don't have the problem of 15 second ZK timeout, we use >>> 30. Is 30 too low still? Is it even remotely related to this problem? What >>> does load have to do with it? >>>>> >>>>> We are not able to reproduce it in lab environments. It can take >>> minutes after cluster startup for it to occur, but also days. >>>>> >>>>> I've been slightly annoyed by problems that can occur in a board time >>> span, it is always bad luck for reproduction. >>>>> >>>>> Any help getting further is much appreciated. >>>>> >>>>> Many thanks, >>>>> Markus >>>>> >>>>> -----Original message----- >>>>>> From:S G <sg.online.em...@gmail.com> >>>>>> Sent: Wednesday 31st January 2018 21:48 >>>>>> To: solr-user@lucene.apache.org >>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart >>>>>> >>>>>> We did some basic load testing on our 7.1.0 and 7.2.1 clusters. >>>>>> And that came out all right. >>>>>> We saw a performance increase of about 30% in read latencies between >>> 6.6.0 >>>>>> and 7.1.0 >>>>>> And then we saw a performance degradation of about 10% between 7.1.0 >>> and >>>>>> 7.2.1 in many metrics. >>>>>> But overall, it still seems better than 6.6.0. >>>>>> >>>>>> I will check for the errors too in the logs but the nodes were >>> responsive >>>>>> for all the 23+ hours we did the load test. >>>>>> >>>>>> Disclaimer: We do not test facets and pivots or block-joins. And will >>> add >>>>>> those features to our load-testing tool sometime this year. >>>>>> >>>>>> Thanks >>>>>> SG >>>>>> >>>>>> >>>>>> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma < >>> markus.jel...@openindex.io> >>>>>> wrote: >>>>>> >>>>>>> Ah thanks, i just submitted a patch fixing it. >>>>>>> >>>>>>> Anyway, in the end it appears this is not the problem we are seeing >>> as our >>>>>>> timeouts were already at 30 seconds. >>>>>>> >>>>>>> All i know is that at some point nodes start to lose ZK connections >>> due to >>>>>>> timeouts (logs say so, but all within 30 seconds), the logs are >>> flooded >>>>>>> with those messages: >>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard from >>> server in >>>>>>> 10359ms for sessionid 0x160f9e723c12122 >>>>>>> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session >>>>>>> 0x60f9e7234f05bb has expired >>>>>>> >>>>>>> Then there is a doubling in heap usage and nodes become >>> unresponsive, die >>>>>>> etc. >>>>>>> >>>>>>> We also see those messages in other collections, but not so >>> frequently and >>>>>>> they don't cause failure in those less loaded clusters. >>>>>>> >>>>>>> Ideas? >>>>>>> >>>>>>> Thanks, >>>>>>> Markus >>>>>>> >>>>>>> -----Original message----- >>>>>>>> From:Michael Braun <n3c...@gmail.com> >>>>>>>> Sent: Monday 29th January 2018 21:09 >>>>>>>> To: solr-user@lucene.apache.org >>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart >>>>>>>> >>>>>>>> Believe this is reported in https://issues.apache.org/ >>>>>>> jira/browse/SOLR-10471 >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma < >>>>>>> markus.jel...@openindex.io> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello SG, >>>>>>>>> >>>>>>>>> The default in solr.in.sh is commented so it defaults to the value >>>>>>> set in >>>>>>>>> bin/solr, which is fifteen seconds. Just uncomment the setting in >>>>>>>>> solr.in.sh and your timeout will be thirty seconds. >>>>>>>>> >>>>>>>>> For Solr itself to really default to thirty seconds, Solr's >>> bin/solr >>>>>>> needs >>>>>>>>> to be patched to use the correct value. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Markus >>>>>>>>> >>>>>>>>> -----Original message----- >>>>>>>>>> From:S G <sg.online.em...@gmail.com> >>>>>>>>>> Sent: Monday 29th January 2018 20:15 >>>>>>>>>> To: solr-user@lucene.apache.org >>>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart >>>>>>>>>> >>>>>>>>>> Hi Markus, >>>>>>>>>> >>>>>>>>>> We are in the process of upgrading our clusters to 7.2.1 and I am >>> not >>>>>>>>> sure >>>>>>>>>> I quite follow the conversation here. >>>>>>>>>> Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a >>> higher >>>>>>>>> value >>>>>>>>>> in the config (and it's just a default value being >>> wrong/overridden >>>>>>>>>> somewhere)? >>>>>>>>>> Or is it more severe in the sense that any config set for >>>>>>>>> ZK_CLIENT_TIMEOUT >>>>>>>>>> by the user is just ignored completely by Solr in 7.2.1 ? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> SG >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma < >>>>>>>>> markus.jel...@openindex.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Ok, i applied the patch and it is clear the timeout is 15000. >>>>>>> Solr.xml >>>>>>>>>>> says 30000 if ZK_CLIENT_TIMEOUT is not set, which is by default >>>>>>> unset >>>>>>>>> in >>>>>>>>>>> solr.in.sh,but set in bin/solr to 15000. So it seems Solr's >>>>>>> default is >>>>>>>>>>> still 15000, not 30000. >>>>>>>>>>> >>>>>>>>>>> But, back to my topic. I see we explicitly set it in solr.in.sh >>> to >>>>>>>>> 30000. >>>>>>>>>>> To be sure, i applied your patch to a production machine, all our >>>>>>>>>>> collections run with 30000. So how would that explain this log >>>>>>> line? >>>>>>>>>>> >>>>>>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard from >>>>>>> server >>>>>>>>> in >>>>>>>>>>> 22130ms >>>>>>>>>>> >>>>>>>>>>> We also see these with smaller values, seven seconds. And, is >>> this >>>>>>>>>>> actually an indicator of the problems we have? >>>>>>>>>>> >>>>>>>>>>> Any ideas? >>>>>>>>>>> >>>>>>>>>>> Many thanks, >>>>>>>>>>> Markus >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -----Original message----- >>>>>>>>>>>> From:Markus Jelsma <markus.jel...@openindex.io> >>>>>>>>>>>> Sent: Saturday 27th January 2018 10:03 >>>>>>>>>>>> To: solr-user@lucene.apache.org >>>>>>>>>>>> Subject: RE: 7.2.1 cluster dies within minutes after restart >>>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> I grepped for it yesterday and found nothing but 30000 in the >>>>>>>>> settings, >>>>>>>>>>> but judging from the weird time out value, you may be right. Let >>> me >>>>>>>>> apply >>>>>>>>>>> your patch early next week and check for spurious warnings. >>>>>>>>>>>> >>>>>>>>>>>> Another note worthy observation for those working on cloud >>>>>>> stability >>>>>>>>> and >>>>>>>>>>> recovery, whenever this happens, some nodes are also absolutely >>>>>>> sure >>>>>>>>> to run >>>>>>>>>>> OOM. The leaders usually live longest, the replica's don't, their >>>>>>> heap >>>>>>>>>>> usage peaks every time, consistently. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Markus >>>>>>>>>>>> >>>>>>>>>>>> -----Original message----- >>>>>>>>>>>>> From:Shawn Heisey <apa...@elyograg.org> >>>>>>>>>>>>> Sent: Saturday 27th January 2018 0:49 >>>>>>>>>>>>> To: solr-user@lucene.apache.org >>>>>>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart >>>>>>>>>>>>> >>>>>>>>>>>>> On 1/26/2018 10:02 AM, Markus Jelsma wrote: >>>>>>>>>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard >>>>>>> from >>>>>>>>>>> server in 22130ms (although zkClientTimeOut is 30000). >>>>>>>>>>>>> >>>>>>>>>>>>> Are you absolutely certain that there is a setting for >>>>>>>>> zkClientTimeout >>>>>>>>>>>>> that is actually getting applied? The default value in Solr's >>>>>>>>> example >>>>>>>>>>>>> configs is 30 seconds, but the internal default in the code >>>>>>> (when >>>>>>>>> no >>>>>>>>>>>>> configuration is found) is still 15. I have confirmed this in >>>>>>> the >>>>>>>>>>> code. >>>>>>>>>>>>> >>>>>>>>>>>>> Looks like SolrCloud doesn't log the values it's using for >>>>>>> things >>>>>>>>> like >>>>>>>>>>>>> zkClientTimeout. I think it should. >>>>>>>>>>>>> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-11915 >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Shawn >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>>> -- >>>> Ere Maijala >>>> Kansalliskirjasto / The National Library of Finland >>>> >>> >>