Re: 7.2.1 cluster dies within minutes after restart

Walter Underwood Fri, 02 Feb 2018 09:39:06 -0800

Zookeeper 3.4.6 is not good? That was the version recommended by Solr docs when 
I installed 6.2.0.


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 2, 2018, at 9:30 AM, Markus Jelsma <markus.jel...@openindex.io> wrote:
> 
> Hello S.G.
> 
> We have relied in Trie* fields every since they became available, i don't 
> think reverting to the old fieldType's will do us any good, we have a very 
> recent problem.
> 
> Regarding our heap, the cluster ran fine for years with just 1.5 GB, we only 
> recently increased it because or data keeps on growing. Heap rarely goes 
> higher than 50 %, except when this specific problem occurs. The nodes have no 
> problem processing a few hundred QPS continuously and can go on for days, 
> sometimes even a few weeks.
> 
> I will keep my eye open for other clues when the problem strikes again!
> 
> Thanks,
> Markus
> 
> -----Original message-----
>> From:S G <sg.online.em...@gmail.com>
>> Sent: Friday 2nd February 2018 18:20
>> To: solr-user@lucene.apache.org
>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>> 
>> Yeah, definitely check the zookeeper version.
>> 3.4.6 is not a good one I know and you can say the same for all the
>> versions below it too.
>> We have used 3.4.9 with no issues.
>> While Solr 7.x uses 3.4.10
>> 
>> Another dimension could be the use or (dis-use) of p-fields like pint,
>> plong etc.
>> If you are using them, try to revert back to tint, tlong etc
>> And if you are not using them, try to use them (Although doing this means a
>> change from your older config and less likely to help).
>> 
>> Lastly, did I read 2 GB for JVM heap?
>> That seems really too less to me for any version of Solr
>> We run with 10-16 gb of heap with G1GC collector and new-gen capped at 3-4gb
>> 
>> 
>> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <markus.jel...@openindex.io>
>> wrote:
>> 
>>> Hello Ere,
>>> 
>>> It appears that my initial e-mail [1] got lost in the thread. We don't
>>> have GC issues, the cluster that dies occasionally runs, in general, smooth
>>> and quick with just 2 GB allocated.
>>> 
>>> Thanks,
>>> Markus
>>> 
>>> [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
>>> within-minutes-after-restart-td4372615.html
>>> 
>>> -----Original message-----
>>>> From:Ere Maijala <ere.maij...@helsinki.fi>
>>>> Sent: Friday 2nd February 2018 8:49
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>>> 
>>>> Markus,
>>>> 
>>>> I may be stating the obvious here, but I didn't notice garbage
>>>> collection mentioned in any of the previous messages, so here goes. In
>>>> our experience almost all of the Zookeeper timeouts etc. have been
>>>> caused by too long garbage collection pauses. I've summed up my
>>>> observations here:
>>>> <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html
>>>> 
>>>> 
>>>> So, in my experience it's relatively easy to cause heavy memory usage
>>>> with SolrCloud with seemingly innocent queries, and GC can become a
>>>> problem really quickly even if everything seems to be running smoothly
>>>> otherwise.
>>>> 
>>>> Regards,
>>>> Ere
>>>> 
>>>> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
>>>>> Hello S.G.
>>>>> 
>>>>> We do not complain about speed improvements at all, it is clear 7.x is
>>> faster than its predecessor. The problem is stability and not recovering
>>> from weird circumstances. In general, it is our high load cluster
>>> containing user interaction logs that suffers the most. Our main text
>>> search cluster - receiving much fewer queries - seems mostly unaffected,
>>> except last Sunday. After very short but high burst of queries it entered
>>> the same catatonic state the logs cluster usually dies from.
>>>>> 
>>>>> The query burst immediately caused ZK timeouts and high heap
>>> consumption (not sure which came first of the latter two). The query burst
>>> lasted for 30 minutes, the excessive heap consumption continued for more
>>> than 8 hours, before Solr finally realized it could relax. Most remarkable
>>> was that Solr recovered on its own, ZK timeouts stopped, heap went back to
>>> normal.
>>>>> 
>>>>> There seems to be a causality between high load and this state.
>>>>> 
>>>>> We really want to get this fixed for ourselves and everyone else that
>>> may encounter this problem, but i don't know how, so i need much more
>>> feedback and hints from those who have deep understanding of inner working
>>> of Solrcloud and changes since 6.x.
>>>>> 
>>>>> To be clear, we don't have the problem of 15 second ZK timeout, we use
>>> 30. Is 30 too low still? Is it even remotely related to this problem? What
>>> does load have to do with it?
>>>>> 
>>>>> We are not able to reproduce it in lab environments. It can take
>>> minutes after cluster startup for it to occur, but also days.
>>>>> 
>>>>> I've been slightly annoyed by problems that can occur in a board time
>>> span, it is always bad luck for reproduction.
>>>>> 
>>>>> Any help getting further is much appreciated.
>>>>> 
>>>>> Many thanks,
>>>>> Markus
>>>>> 
>>>>> -----Original message-----
>>>>>> From:S G <sg.online.em...@gmail.com>
>>>>>> Sent: Wednesday 31st January 2018 21:48
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>>>>> 
>>>>>> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
>>>>>> And that came out all right.
>>>>>> We saw a performance increase of about 30% in read latencies between
>>> 6.6.0
>>>>>> and 7.1.0
>>>>>> And then we saw a performance degradation of about 10% between 7.1.0
>>> and
>>>>>> 7.2.1 in many metrics.
>>>>>> But overall, it still seems better than 6.6.0.
>>>>>> 
>>>>>> I will check for the errors too in the logs but the nodes were
>>> responsive
>>>>>> for all the 23+ hours we did the load test.
>>>>>> 
>>>>>> Disclaimer: We do not test facets and pivots or block-joins. And will
>>> add
>>>>>> those features to our load-testing tool sometime this year.
>>>>>> 
>>>>>> Thanks
>>>>>> SG
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <
>>> markus.jel...@openindex.io>
>>>>>> wrote:
>>>>>> 
>>>>>>> Ah thanks, i just submitted a patch fixing it.
>>>>>>> 
>>>>>>> Anyway, in the end it appears this is not the problem we are seeing
>>> as our
>>>>>>> timeouts were already at 30 seconds.
>>>>>>> 
>>>>>>> All i know is that at some point nodes start to lose ZK connections
>>> due to
>>>>>>> timeouts (logs say so, but all within 30 seconds), the logs are
>>> flooded
>>>>>>> with those messages:
>>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard from
>>> server in
>>>>>>> 10359ms for sessionid 0x160f9e723c12122
>>>>>>> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
>>>>>>> 0x60f9e7234f05bb has expired
>>>>>>> 
>>>>>>> Then there is a doubling in heap usage and nodes become
>>> unresponsive, die
>>>>>>> etc.
>>>>>>> 
>>>>>>> We also see those messages in other collections, but not so
>>> frequently and
>>>>>>> they don't cause failure in those less loaded clusters.
>>>>>>> 
>>>>>>> Ideas?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Markus
>>>>>>> 
>>>>>>> -----Original message-----
>>>>>>>> From:Michael Braun <n3c...@gmail.com>
>>>>>>>> Sent: Monday 29th January 2018 21:09
>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>>>>>>> 
>>>>>>>> Believe this is reported in https://issues.apache.org/
>>>>>>> jira/browse/SOLR-10471
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
>>>>>>> markus.jel...@openindex.io>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hello SG,
>>>>>>>>> 
>>>>>>>>> The default in solr.in.sh is commented so it defaults to the value
>>>>>>> set in
>>>>>>>>> bin/solr, which is fifteen seconds. Just uncomment the setting in
>>>>>>>>> solr.in.sh and your timeout will be thirty seconds.
>>>>>>>>> 
>>>>>>>>> For Solr itself to really default to thirty seconds, Solr's
>>> bin/solr
>>>>>>> needs
>>>>>>>>> to be patched to use the correct value.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Markus
>>>>>>>>> 
>>>>>>>>> -----Original message-----
>>>>>>>>>> From:S G <sg.online.em...@gmail.com>
>>>>>>>>>> Sent: Monday 29th January 2018 20:15
>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>>>>>>>>> 
>>>>>>>>>> Hi Markus,
>>>>>>>>>> 
>>>>>>>>>> We are in the process of upgrading our clusters to 7.2.1 and I am
>>> not
>>>>>>>>> sure
>>>>>>>>>> I quite follow the conversation here.
>>>>>>>>>> Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a
>>> higher
>>>>>>>>> value
>>>>>>>>>> in the config (and it's just a default value being
>>> wrong/overridden
>>>>>>>>>> somewhere)?
>>>>>>>>>> Or is it more severe in the sense that any config set for
>>>>>>>>> ZK_CLIENT_TIMEOUT
>>>>>>>>>> by the user is just ignored completely by Solr in 7.2.1 ?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> SG
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
>>>>>>>>> markus.jel...@openindex.io>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Ok, i applied the patch and it is clear the timeout is 15000.
>>>>>>> Solr.xml
>>>>>>>>>>> says 30000 if ZK_CLIENT_TIMEOUT is not set, which is by default
>>>>>>> unset
>>>>>>>>> in
>>>>>>>>>>> solr.in.sh,but set in bin/solr to 15000. So it seems Solr's
>>>>>>> default is
>>>>>>>>>>> still 15000, not 30000.
>>>>>>>>>>> 
>>>>>>>>>>> But, back to my topic. I see we explicitly set it in solr.in.sh
>>> to
>>>>>>>>> 30000.
>>>>>>>>>>> To be sure, i applied your patch to a production machine, all our
>>>>>>>>>>> collections run with 30000. So how would that explain this log
>>>>>>> line?
>>>>>>>>>>> 
>>>>>>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard from
>>>>>>> server
>>>>>>>>> in
>>>>>>>>>>> 22130ms
>>>>>>>>>>> 
>>>>>>>>>>> We also see these with smaller values, seven seconds. And, is
>>> this
>>>>>>>>>>> actually an indicator of the problems we have?
>>>>>>>>>>> 
>>>>>>>>>>> Any ideas?
>>>>>>>>>>> 
>>>>>>>>>>> Many thanks,
>>>>>>>>>>> Markus
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>> From:Markus Jelsma <markus.jel...@openindex.io>
>>>>>>>>>>>> Sent: Saturday 27th January 2018 10:03
>>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>>> Subject: RE: 7.2.1 cluster dies within minutes after restart
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> I grepped for it yesterday and found nothing but 30000 in the
>>>>>>>>> settings,
>>>>>>>>>>> but judging from the weird time out value, you may be right. Let
>>> me
>>>>>>>>> apply
>>>>>>>>>>> your patch early next week and check for spurious warnings.
>>>>>>>>>>>> 
>>>>>>>>>>>> Another note worthy observation for those working on cloud
>>>>>>> stability
>>>>>>>>> and
>>>>>>>>>>> recovery, whenever this happens, some nodes are also absolutely
>>>>>>> sure
>>>>>>>>> to run
>>>>>>>>>>> OOM. The leaders usually live longest, the replica's don't, their
>>>>>>> heap
>>>>>>>>>>> usage peaks every time, consistently.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Markus
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>>> From:Shawn Heisey <apa...@elyograg.org>
>>>>>>>>>>>>> Sent: Saturday 27th January 2018 0:49
>>>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 1/26/2018 10:02 AM, Markus Jelsma wrote:
>>>>>>>>>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard
>>>>>>> from
>>>>>>>>>>> server in 22130ms (although zkClientTimeOut is 30000).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Are you absolutely certain that there is a setting for
>>>>>>>>> zkClientTimeout
>>>>>>>>>>>>> that is actually getting applied?  The default value in Solr's
>>>>>>>>> example
>>>>>>>>>>>>> configs is 30 seconds, but the internal default in the code
>>>>>>> (when
>>>>>>>>> no
>>>>>>>>>>>>> configuration is found) is still 15.  I have confirmed this in
>>>>>>> the
>>>>>>>>>>> code.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Looks like SolrCloud doesn't log the values it's using for
>>>>>>> things
>>>>>>>>> like
>>>>>>>>>>>>> zkClientTimeout.  I think it should.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-11915
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> --
>>>> Ere Maijala
>>>> Kansalliskirjasto / The National Library of Finland
>>>> 
>>> 
>>

Re: 7.2.1 cluster dies within minutes after restart

Reply via email to