Re: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread S G
Our 3.4.6 ZK nodes were unable to join the cluster unless their quorum got
broken.
So if there was 5 node zookeeper and it lost 2 nodes, they would not rejoin
because ZK still had its quorum.
To make them join, you had to break the quorum by restarting a node in
quorum.
Only when quorum broke, did ZK realize that something was wrong and it
recognized the other two nodes trying to rejoin.
Also this problem happened only when ZK had been running for a long time,
like several weeks (perhaps DNS caching or something, not sure really).


On Fri, Feb 2, 2018 at 11:32 AM, Tomas Fernandez Lobbe <tflo...@apple.com>
wrote:

> Hi Markus,
> If the same code that runs OK in 7.1 breaks 7.2.1, it is clear to me that
> there is some bug in Solr introduced between those releases (maybe an
> increase in memory utilization? or maybe some decrease in query throughput
> making threads to pile up?). I’d hate to have this issue lost in the users
> list, could you create a Jira? Maybe next time you have this issue you can
> post thread/heap dumps, that would be useful.
>
> Tomás
>
> > On Feb 2, 2018, at 9:38 AM, Walter Underwood <wun...@wunderwood.org>
> wrote:
> >
> > Zookeeper 3.4.6 is not good? That was the version recommended by Solr
> docs when I installed 6.2.0.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Feb 2, 2018, at 9:30 AM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> >>
> >> Hello S.G.
> >>
> >> We have relied in Trie* fields every since they became available, i
> don't think reverting to the old fieldType's will do us any good, we have a
> very recent problem.
> >>
> >> Regarding our heap, the cluster ran fine for years with just 1.5 GB, we
> only recently increased it because or data keeps on growing. Heap rarely
> goes higher than 50 %, except when this specific problem occurs. The nodes
> have no problem processing a few hundred QPS continuously and can go on for
> days, sometimes even a few weeks.
> >>
> >> I will keep my eye open for other clues when the problem strikes again!
> >>
> >> Thanks,
> >> Markus
> >>
> >> -Original message-
> >>> From:S G <sg.online.em...@gmail.com>
> >>> Sent: Friday 2nd February 2018 18:20
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>
> >>> Yeah, definitely check the zookeeper version.
> >>> 3.4.6 is not a good one I know and you can say the same for all the
> >>> versions below it too.
> >>> We have used 3.4.9 with no issues.
> >>> While Solr 7.x uses 3.4.10
> >>>
> >>> Another dimension could be the use or (dis-use) of p-fields like pint,
> >>> plong etc.
> >>> If you are using them, try to revert back to tint, tlong etc
> >>> And if you are not using them, try to use them (Although doing this
> means a
> >>> change from your older config and less likely to help).
> >>>
> >>> Lastly, did I read 2 GB for JVM heap?
> >>> That seems really too less to me for any version of Solr
> >>> We run with 10-16 gb of heap with G1GC collector and new-gen capped at
> 3-4gb
> >>>
> >>>
> >>> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <
> markus.jel...@openindex.io>
> >>> wrote:
> >>>
> >>>> Hello Ere,
> >>>>
> >>>> It appears that my initial e-mail [1] got lost in the thread. We don't
> >>>> have GC issues, the cluster that dies occasionally runs, in general,
> smooth
> >>>> and quick with just 2 GB allocated.
> >>>>
> >>>> Thanks,
> >>>> Markus
> >>>>
> >>>> [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
> >>>> within-minutes-after-restart-td4372615.html
> >>>>
> >>>> -Original message-
> >>>>> From:Ere Maijala <ere.maij...@helsinki.fi>
> >>>>> Sent: Friday 2nd February 2018 8:49
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>>
> >>>>> Markus,
> >>>>>
> >>>>> I may be stating the obvious here, but I didn't notice garbage
> >>>>> collection mentioned in any of the previous messages, so here goes.
> In
> >>>&

Re: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread Tomas Fernandez Lobbe
Hi Markus, 
If the same code that runs OK in 7.1 breaks 7.2.1, it is clear to me that there 
is some bug in Solr introduced between those releases (maybe an increase in 
memory utilization? or maybe some decrease in query throughput making threads 
to pile up?). I’d hate to have this issue lost in the users list, could you 
create a Jira? Maybe next time you have this issue you can post thread/heap 
dumps, that would be useful.

Tomás

> On Feb 2, 2018, at 9:38 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> Zookeeper 3.4.6 is not good? That was the version recommended by Solr docs 
> when I installed 6.2.0.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 2, 2018, at 9:30 AM, Markus Jelsma <markus.jel...@openindex.io> wrote:
>> 
>> Hello S.G.
>> 
>> We have relied in Trie* fields every since they became available, i don't 
>> think reverting to the old fieldType's will do us any good, we have a very 
>> recent problem.
>> 
>> Regarding our heap, the cluster ran fine for years with just 1.5 GB, we only 
>> recently increased it because or data keeps on growing. Heap rarely goes 
>> higher than 50 %, except when this specific problem occurs. The nodes have 
>> no problem processing a few hundred QPS continuously and can go on for days, 
>> sometimes even a few weeks.
>> 
>> I will keep my eye open for other clues when the problem strikes again!
>> 
>> Thanks,
>> Markus
>> 
>> -Original message-
>>> From:S G <sg.online.em...@gmail.com>
>>> Sent: Friday 2nd February 2018 18:20
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>> 
>>> Yeah, definitely check the zookeeper version.
>>> 3.4.6 is not a good one I know and you can say the same for all the
>>> versions below it too.
>>> We have used 3.4.9 with no issues.
>>> While Solr 7.x uses 3.4.10
>>> 
>>> Another dimension could be the use or (dis-use) of p-fields like pint,
>>> plong etc.
>>> If you are using them, try to revert back to tint, tlong etc
>>> And if you are not using them, try to use them (Although doing this means a
>>> change from your older config and less likely to help).
>>> 
>>> Lastly, did I read 2 GB for JVM heap?
>>> That seems really too less to me for any version of Solr
>>> We run with 10-16 gb of heap with G1GC collector and new-gen capped at 3-4gb
>>> 
>>> 
>>> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <markus.jel...@openindex.io>
>>> wrote:
>>> 
>>>> Hello Ere,
>>>> 
>>>> It appears that my initial e-mail [1] got lost in the thread. We don't
>>>> have GC issues, the cluster that dies occasionally runs, in general, smooth
>>>> and quick with just 2 GB allocated.
>>>> 
>>>> Thanks,
>>>> Markus
>>>> 
>>>> [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
>>>> within-minutes-after-restart-td4372615.html
>>>> 
>>>> -Original message-
>>>>> From:Ere Maijala <ere.maij...@helsinki.fi>
>>>>> Sent: Friday 2nd February 2018 8:49
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>>>> 
>>>>> Markus,
>>>>> 
>>>>> I may be stating the obvious here, but I didn't notice garbage
>>>>> collection mentioned in any of the previous messages, so here goes. In
>>>>> our experience almost all of the Zookeeper timeouts etc. have been
>>>>> caused by too long garbage collection pauses. I've summed up my
>>>>> observations here:
>>>>> <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html
>>>>> 
>>>>> 
>>>>> So, in my experience it's relatively easy to cause heavy memory usage
>>>>> with SolrCloud with seemingly innocent queries, and GC can become a
>>>>> problem really quickly even if everything seems to be running smoothly
>>>>> otherwise.
>>>>> 
>>>>> Regards,
>>>>> Ere
>>>>> 
>>>>> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
>>>>>> Hello S.G.
>>>>>> 
>>>>>> We do not complain about speed improvements at all, it is clear 7.x is
>>>> faster than its p

Re: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread Walter Underwood
Zookeeper 3.4.6 is not good? That was the version recommended by Solr docs when 
I installed 6.2.0.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 2, 2018, at 9:30 AM, Markus Jelsma <markus.jel...@openindex.io> wrote:
> 
> Hello S.G.
> 
> We have relied in Trie* fields every since they became available, i don't 
> think reverting to the old fieldType's will do us any good, we have a very 
> recent problem.
> 
> Regarding our heap, the cluster ran fine for years with just 1.5 GB, we only 
> recently increased it because or data keeps on growing. Heap rarely goes 
> higher than 50 %, except when this specific problem occurs. The nodes have no 
> problem processing a few hundred QPS continuously and can go on for days, 
> sometimes even a few weeks.
> 
> I will keep my eye open for other clues when the problem strikes again!
> 
> Thanks,
> Markus
> 
> -Original message-
>> From:S G <sg.online.em...@gmail.com>
>> Sent: Friday 2nd February 2018 18:20
>> To: solr-user@lucene.apache.org
>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>> 
>> Yeah, definitely check the zookeeper version.
>> 3.4.6 is not a good one I know and you can say the same for all the
>> versions below it too.
>> We have used 3.4.9 with no issues.
>> While Solr 7.x uses 3.4.10
>> 
>> Another dimension could be the use or (dis-use) of p-fields like pint,
>> plong etc.
>> If you are using them, try to revert back to tint, tlong etc
>> And if you are not using them, try to use them (Although doing this means a
>> change from your older config and less likely to help).
>> 
>> Lastly, did I read 2 GB for JVM heap?
>> That seems really too less to me for any version of Solr
>> We run with 10-16 gb of heap with G1GC collector and new-gen capped at 3-4gb
>> 
>> 
>> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <markus.jel...@openindex.io>
>> wrote:
>> 
>>> Hello Ere,
>>> 
>>> It appears that my initial e-mail [1] got lost in the thread. We don't
>>> have GC issues, the cluster that dies occasionally runs, in general, smooth
>>> and quick with just 2 GB allocated.
>>> 
>>> Thanks,
>>> Markus
>>> 
>>> [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
>>> within-minutes-after-restart-td4372615.html
>>> 
>>> -Original message-
>>>> From:Ere Maijala <ere.maij...@helsinki.fi>
>>>> Sent: Friday 2nd February 2018 8:49
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
>>>> 
>>>> Markus,
>>>> 
>>>> I may be stating the obvious here, but I didn't notice garbage
>>>> collection mentioned in any of the previous messages, so here goes. In
>>>> our experience almost all of the Zookeeper timeouts etc. have been
>>>> caused by too long garbage collection pauses. I've summed up my
>>>> observations here:
>>>> <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html
>>>> 
>>>> 
>>>> So, in my experience it's relatively easy to cause heavy memory usage
>>>> with SolrCloud with seemingly innocent queries, and GC can become a
>>>> problem really quickly even if everything seems to be running smoothly
>>>> otherwise.
>>>> 
>>>> Regards,
>>>> Ere
>>>> 
>>>> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
>>>>> Hello S.G.
>>>>> 
>>>>> We do not complain about speed improvements at all, it is clear 7.x is
>>> faster than its predecessor. The problem is stability and not recovering
>>> from weird circumstances. In general, it is our high load cluster
>>> containing user interaction logs that suffers the most. Our main text
>>> search cluster - receiving much fewer queries - seems mostly unaffected,
>>> except last Sunday. After very short but high burst of queries it entered
>>> the same catatonic state the logs cluster usually dies from.
>>>>> 
>>>>> The query burst immediately caused ZK timeouts and high heap
>>> consumption (not sure which came first of the latter two). The query burst
>>> lasted for 30 minutes, the excessive heap consumption continued for more
>>> than 8 hours, before Solr finally realized it could relax. Most remarkable
>>> was that Solr recovered on its own, ZK timeouts stopped, heap went back to
>

RE: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread Markus Jelsma
Hello S.G.

We have relied in Trie* fields every since they became available, i don't think 
reverting to the old fieldType's will do us any good, we have a very recent 
problem.

Regarding our heap, the cluster ran fine for years with just 1.5 GB, we only 
recently increased it because or data keeps on growing. Heap rarely goes higher 
than 50 %, except when this specific problem occurs. The nodes have no problem 
processing a few hundred QPS continuously and can go on for days, sometimes 
even a few weeks.

I will keep my eye open for other clues when the problem strikes again!

Thanks,
Markus

-Original message-
> From:S G <sg.online.em...@gmail.com>
> Sent: Friday 2nd February 2018 18:20
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Yeah, definitely check the zookeeper version.
> 3.4.6 is not a good one I know and you can say the same for all the
> versions below it too.
> We have used 3.4.9 with no issues.
> While Solr 7.x uses 3.4.10
> 
> Another dimension could be the use or (dis-use) of p-fields like pint,
> plong etc.
> If you are using them, try to revert back to tint, tlong etc
> And if you are not using them, try to use them (Although doing this means a
> change from your older config and less likely to help).
> 
> Lastly, did I read 2 GB for JVM heap?
> That seems really too less to me for any version of Solr
> We run with 10-16 gb of heap with G1GC collector and new-gen capped at 3-4gb
> 
> 
> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello Ere,
> >
> > It appears that my initial e-mail [1] got lost in the thread. We don't
> > have GC issues, the cluster that dies occasionally runs, in general, smooth
> > and quick with just 2 GB allocated.
> >
> > Thanks,
> > Markus
> >
> > [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
> > within-minutes-after-restart-td4372615.html
> >
> > -Original message-----
> > > From:Ere Maijala <ere.maij...@helsinki.fi>
> > > Sent: Friday 2nd February 2018 8:49
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > Markus,
> > >
> > > I may be stating the obvious here, but I didn't notice garbage
> > > collection mentioned in any of the previous messages, so here goes. In
> > > our experience almost all of the Zookeeper timeouts etc. have been
> > > caused by too long garbage collection pauses. I've summed up my
> > > observations here:
> > > <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html
> > >
> > >
> > > So, in my experience it's relatively easy to cause heavy memory usage
> > > with SolrCloud with seemingly innocent queries, and GC can become a
> > > problem really quickly even if everything seems to be running smoothly
> > > otherwise.
> > >
> > > Regards,
> > > Ere
> > >
> > > Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
> > > > Hello S.G.
> > > >
> > > > We do not complain about speed improvements at all, it is clear 7.x is
> > faster than its predecessor. The problem is stability and not recovering
> > from weird circumstances. In general, it is our high load cluster
> > containing user interaction logs that suffers the most. Our main text
> > search cluster - receiving much fewer queries - seems mostly unaffected,
> > except last Sunday. After very short but high burst of queries it entered
> > the same catatonic state the logs cluster usually dies from.
> > > >
> > > > The query burst immediately caused ZK timeouts and high heap
> > consumption (not sure which came first of the latter two). The query burst
> > lasted for 30 minutes, the excessive heap consumption continued for more
> > than 8 hours, before Solr finally realized it could relax. Most remarkable
> > was that Solr recovered on its own, ZK timeouts stopped, heap went back to
> > normal.
> > > >
> > > > There seems to be a causality between high load and this state.
> > > >
> > > > We really want to get this fixed for ourselves and everyone else that
> > may encounter this problem, but i don't know how, so i need much more
> > feedback and hints from those who have deep understanding of inner working
> > of Solrcloud and changes since 6.x.
> > > >
> > > > To be clear, we don't have the problem of 15 second ZK timeout, we use
> > 30. Is 30 too low still? Is it even remotely r

Re: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread S G
Yeah, definitely check the zookeeper version.
3.4.6 is not a good one I know and you can say the same for all the
versions below it too.
We have used 3.4.9 with no issues.
While Solr 7.x uses 3.4.10

Another dimension could be the use or (dis-use) of p-fields like pint,
plong etc.
If you are using them, try to revert back to tint, tlong etc
And if you are not using them, try to use them (Although doing this means a
change from your older config and less likely to help).

Lastly, did I read 2 GB for JVM heap?
That seems really too less to me for any version of Solr
We run with 10-16 gb of heap with G1GC collector and new-gen capped at 3-4gb


On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello Ere,
>
> It appears that my initial e-mail [1] got lost in the thread. We don't
> have GC issues, the cluster that dies occasionally runs, in general, smooth
> and quick with just 2 GB allocated.
>
> Thanks,
> Markus
>
> [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
> within-minutes-after-restart-td4372615.html
>
> -Original message-
> > From:Ere Maijala <ere.maij...@helsinki.fi>
> > Sent: Friday 2nd February 2018 8:49
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.2.1 cluster dies within minutes after restart
> >
> > Markus,
> >
> > I may be stating the obvious here, but I didn't notice garbage
> > collection mentioned in any of the previous messages, so here goes. In
> > our experience almost all of the Zookeeper timeouts etc. have been
> > caused by too long garbage collection pauses. I've summed up my
> > observations here:
> > <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html
> >
> >
> > So, in my experience it's relatively easy to cause heavy memory usage
> > with SolrCloud with seemingly innocent queries, and GC can become a
> > problem really quickly even if everything seems to be running smoothly
> > otherwise.
> >
> > Regards,
> > Ere
> >
> > Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
> > > Hello S.G.
> > >
> > > We do not complain about speed improvements at all, it is clear 7.x is
> faster than its predecessor. The problem is stability and not recovering
> from weird circumstances. In general, it is our high load cluster
> containing user interaction logs that suffers the most. Our main text
> search cluster - receiving much fewer queries - seems mostly unaffected,
> except last Sunday. After very short but high burst of queries it entered
> the same catatonic state the logs cluster usually dies from.
> > >
> > > The query burst immediately caused ZK timeouts and high heap
> consumption (not sure which came first of the latter two). The query burst
> lasted for 30 minutes, the excessive heap consumption continued for more
> than 8 hours, before Solr finally realized it could relax. Most remarkable
> was that Solr recovered on its own, ZK timeouts stopped, heap went back to
> normal.
> > >
> > > There seems to be a causality between high load and this state.
> > >
> > > We really want to get this fixed for ourselves and everyone else that
> may encounter this problem, but i don't know how, so i need much more
> feedback and hints from those who have deep understanding of inner working
> of Solrcloud and changes since 6.x.
> > >
> > > To be clear, we don't have the problem of 15 second ZK timeout, we use
> 30. Is 30 too low still? Is it even remotely related to this problem? What
> does load have to do with it?
> > >
> > > We are not able to reproduce it in lab environments. It can take
> minutes after cluster startup for it to occur, but also days.
> > >
> > > I've been slightly annoyed by problems that can occur in a board time
> span, it is always bad luck for reproduction.
> > >
> > > Any help getting further is much appreciated.
> > >
> > > Many thanks,
> > > Markus
> > >
> > > -Original message-
> > >> From:S G <sg.online.em...@gmail.com>
> > >> Sent: Wednesday 31st January 2018 21:48
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >>
> > >> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> > >> And that came out all right.
> > >> We saw a performance increase of about 30% in read latencies between
> 6.6.0
> > >> and 7.1.0
> > >> And then we saw a performance degradation of about 10% between 7.1.0
> and
> > >> 7.2.1 in many metrics.

RE: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread Markus Jelsma
Hello Ere,

It appears that my initial e-mail [1] got lost in the thread. We don't have GC 
issues, the cluster that dies occasionally runs, in general, smooth and quick 
with just 2 GB allocated.

Thanks,
Markus

[1]: 
http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-within-minutes-after-restart-td4372615.html

-Original message-
> From:Ere Maijala <ere.maij...@helsinki.fi>
> Sent: Friday 2nd February 2018 8:49
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Markus,
> 
> I may be stating the obvious here, but I didn't notice garbage 
> collection mentioned in any of the previous messages, so here goes. In 
> our experience almost all of the Zookeeper timeouts etc. have been 
> caused by too long garbage collection pauses. I've summed up my 
> observations here: 
> <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html>
> 
> So, in my experience it's relatively easy to cause heavy memory usage 
> with SolrCloud with seemingly innocent queries, and GC can become a 
> problem really quickly even if everything seems to be running smoothly 
> otherwise.
> 
> Regards,
> Ere
> 
> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
> > Hello S.G.
> > 
> > We do not complain about speed improvements at all, it is clear 7.x is 
> > faster than its predecessor. The problem is stability and not recovering 
> > from weird circumstances. In general, it is our high load cluster 
> > containing user interaction logs that suffers the most. Our main text 
> > search cluster - receiving much fewer queries - seems mostly unaffected, 
> > except last Sunday. After very short but high burst of queries it entered 
> > the same catatonic state the logs cluster usually dies from.
> > 
> > The query burst immediately caused ZK timeouts and high heap consumption 
> > (not sure which came first of the latter two). The query burst lasted for 
> > 30 minutes, the excessive heap consumption continued for more than 8 hours, 
> > before Solr finally realized it could relax. Most remarkable was that Solr 
> > recovered on its own, ZK timeouts stopped, heap went back to normal.
> > 
> > There seems to be a causality between high load and this state.
> > 
> > We really want to get this fixed for ourselves and everyone else that may 
> > encounter this problem, but i don't know how, so i need much more feedback 
> > and hints from those who have deep understanding of inner working of 
> > Solrcloud and changes since 6.x.
> > 
> > To be clear, we don't have the problem of 15 second ZK timeout, we use 30. 
> > Is 30 too low still? Is it even remotely related to this problem? What does 
> > load have to do with it?
> > 
> > We are not able to reproduce it in lab environments. It can take minutes 
> > after cluster startup for it to occur, but also days.
> > 
> > I've been slightly annoyed by problems that can occur in a board time span, 
> > it is always bad luck for reproduction.
> > 
> > Any help getting further is much appreciated.
> > 
> > Many thanks,
> > Markus
> >   
> > -Original message-
> >> From:S G <sg.online.em...@gmail.com>
> >> Sent: Wednesday 31st January 2018 21:48
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>
> >> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> >> And that came out all right.
> >> We saw a performance increase of about 30% in read latencies between 6.6.0
> >> and 7.1.0
> >> And then we saw a performance degradation of about 10% between 7.1.0 and
> >> 7.2.1 in many metrics.
> >> But overall, it still seems better than 6.6.0.
> >>
> >> I will check for the errors too in the logs but the nodes were responsive
> >> for all the 23+ hours we did the load test.
> >>
> >> Disclaimer: We do not test facets and pivots or block-joins. And will add
> >> those features to our load-testing tool sometime this year.
> >>
> >> Thanks
> >> SG
> >>
> >>
> >> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <markus.jel...@openindex.io>
> >> wrote:
> >>
> >>> Ah thanks, i just submitted a patch fixing it.
> >>>
> >>> Anyway, in the end it appears this is not the problem we are seeing as our
> >>> timeouts were already at 30 seconds.
> >>>
> >>> All i know is that at some point nodes start to lose ZK connections due to
> >>> timeouts (

RE: 7.2.1 cluster dies within minutes after restart

2018-02-02 Thread Markus Jelsma
Hello S.G, see inline.

Thanks,
Markus
 
-Original message-
> From:S G <sg.online.em...@gmail.com>
> Sent: Thursday 1st February 2018 17:42
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> ok, good to know that 7.x shows good performance for you too.
> 
> 1) Regarding the zookeeper problem, do you know for sure that it does not
> occur in 6.x ?
>  I would suggest to write a small load-test that can send a similar
> kind of load to 6.x and 7.x clusters and see which one breaks.
>  I know that these kind of problems can take days to occur but without
> a reproducible pattern, it may be hard to fix.

It is not reproducible in controlled environments. I have not seen Solr's 
cluster stability being able to deteriorate so much (since 1.2)

> 
> 2) Another thing is the zookeeper version.
> 7.x uses 3.4.10 version of zookeeper (See
> https://github.com/apache/lucene-solr/blob/branch_7_2/lucene/ivy-versions.properties#L192
> )
> If you are using 3.4.10, try using 3.4.9 or vice versa.
> Do not use zookeeper versions lower than 3.4.9 - they have some nasty
> bugs.

I am not  sure which we version we run. I'll check with my collegue, and 
upgrade it necessary. A note, all other Solr collections and all Hadoop daemons 
have no trouble with Zookeeper.

> 
> 3) Do take a look at zookeeper cluster too.
> ZK has 4-letter commands like ruok, srvr etc that reveal a lot of its
> internal activity.

Zookeeper logs have also shown messages of clients disconnecting and stuff, but 
it is hard to find the problem there. The problem is that one very specific 
Solr cluster dies, and others don't.

> 
> 4) Hopefully, you are not doing anything cross-DC as that could cause
> network delays and cause such problems.

No.

> 
> 5) As far as I can remember, we have seen some zookeeper issues but they
> were generally related to 3.4.6 version or
> VMs getting replaced in cloud environment and the IP's not getting
> refreshed in the ZK's configs.

I've seen 3.4.6 somewhere in our Salt files so i think we still may be running 
that version, but i'll check.

> 
> That's all I could think of from a user's perspective  --\_(0.0)_/--
> 
> Thanks
> SG
> 
> 
> 
> On Wed, Jan 31, 2018 at 1:56 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello S.G.
> >
> > We do not complain about speed improvements at all, it is clear 7.x is
> > faster than its predecessor. The problem is stability and not recovering
> > from weird circumstances. In general, it is our high load cluster
> > containing user interaction logs that suffers the most. Our main text
> > search cluster - receiving much fewer queries - seems mostly unaffected,
> > except last Sunday. After very short but high burst of queries it entered
> > the same catatonic state the logs cluster usually dies from.
> >
> > The query burst immediately caused ZK timeouts and high heap consumption
> > (not sure which came first of the latter two). The query burst lasted for
> > 30 minutes, the excessive heap consumption continued for more than 8 hours,
> > before Solr finally realized it could relax. Most remarkable was that Solr
> > recovered on its own, ZK timeouts stopped, heap went back to normal.
> >
> > There seems to be a causality between high load and this state.
> >
> > We really want to get this fixed for ourselves and everyone else that may
> > encounter this problem, but i don't know how, so i need much more feedback
> > and hints from those who have deep understanding of inner working of
> > Solrcloud and changes since 6.x.
> >
> > To be clear, we don't have the problem of 15 second ZK timeout, we use 30.
> > Is 30 too low still? Is it even remotely related to this problem? What does
> > load have to do with it?
> >
> > We are not able to reproduce it in lab environments. It can take minutes
> > after cluster startup for it to occur, but also days.
> >
> > I've been slightly annoyed by problems that can occur in a board time
> > span, it is always bad luck for reproduction.
> >
> > Any help getting further is much appreciated.
> >
> > Many thanks,
> > Markus
> >
> > -Original message-
> > > From:S G <sg.online.em...@gmail.com>
> > > Sent: Wednesday 31st January 2018 21:48
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> > > And that came out all right.
> > > We saw a performance increase of about 30%

Re: 7.2.1 cluster dies within minutes after restart

2018-02-01 Thread Ere Maijala

Markus,

I may be stating the obvious here, but I didn't notice garbage 
collection mentioned in any of the previous messages, so here goes. In 
our experience almost all of the Zookeeper timeouts etc. have been 
caused by too long garbage collection pauses. I've summed up my 
observations here: 
<https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html>


So, in my experience it's relatively easy to cause heavy memory usage 
with SolrCloud with seemingly innocent queries, and GC can become a 
problem really quickly even if everything seems to be running smoothly 
otherwise.


Regards,
Ere

Markus Jelsma kirjoitti 31.1.2018 klo 23.56:

Hello S.G.

We do not complain about speed improvements at all, it is clear 7.x is faster 
than its predecessor. The problem is stability and not recovering from weird 
circumstances. In general, it is our high load cluster containing user 
interaction logs that suffers the most. Our main text search cluster - 
receiving much fewer queries - seems mostly unaffected, except last Sunday. 
After very short but high burst of queries it entered the same catatonic state 
the logs cluster usually dies from.

The query burst immediately caused ZK timeouts and high heap consumption (not 
sure which came first of the latter two). The query burst lasted for 30 
minutes, the excessive heap consumption continued for more than 8 hours, before 
Solr finally realized it could relax. Most remarkable was that Solr recovered 
on its own, ZK timeouts stopped, heap went back to normal.

There seems to be a causality between high load and this state.

We really want to get this fixed for ourselves and everyone else that may 
encounter this problem, but i don't know how, so i need much more feedback and 
hints from those who have deep understanding of inner working of Solrcloud and 
changes since 6.x.

To be clear, we don't have the problem of 15 second ZK timeout, we use 30. Is 
30 too low still? Is it even remotely related to this problem? What does load 
have to do with it?

We are not able to reproduce it in lab environments. It can take minutes after 
cluster startup for it to occur, but also days.

I've been slightly annoyed by problems that can occur in a board time span, it 
is always bad luck for reproduction.

Any help getting further is much appreciated.

Many thanks,
Markus
  
-Original message-

From:S G <sg.online.em...@gmail.com>
Sent: Wednesday 31st January 2018 21:48
To: solr-user@lucene.apache.org
Subject: Re: 7.2.1 cluster dies within minutes after restart

We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
And that came out all right.
We saw a performance increase of about 30% in read latencies between 6.6.0
and 7.1.0
And then we saw a performance degradation of about 10% between 7.1.0 and
7.2.1 in many metrics.
But overall, it still seems better than 6.6.0.

I will check for the errors too in the logs but the nodes were responsive
for all the 23+ hours we did the load test.

Disclaimer: We do not test facets and pivots or block-joins. And will add
those features to our load-testing tool sometime this year.

Thanks
SG


On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:


Ah thanks, i just submitted a patch fixing it.

Anyway, in the end it appears this is not the problem we are seeing as our
timeouts were already at 30 seconds.

All i know is that at some point nodes start to lose ZK connections due to
timeouts (logs say so, but all within 30 seconds), the logs are flooded
with those messages:
o.a.z.ClientCnxn Client session timed out, have not heard from server in
10359ms for sessionid 0x160f9e723c12122
o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
0x60f9e7234f05bb has expired

Then there is a doubling in heap usage and nodes become unresponsive, die
etc.

We also see those messages in other collections, but not so frequently and
they don't cause failure in those less loaded clusters.

Ideas?

Thanks,
Markus

-Original message-

From:Michael Braun <n3c...@gmail.com>
Sent: Monday 29th January 2018 21:09
To: solr-user@lucene.apache.org
Subject: Re: 7.2.1 cluster dies within minutes after restart

Believe this is reported in https://issues.apache.org/

jira/browse/SOLR-10471



On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <

markus.jel...@openindex.io>

wrote:


Hello SG,

The default in solr.in.sh is commented so it defaults to the value

set in

bin/solr, which is fifteen seconds. Just uncomment the setting in
solr.in.sh and your timeout will be thirty seconds.

For Solr itself to really default to thirty seconds, Solr's bin/solr

needs

to be patched to use the correct value.

Regards,
Markus

-Original message-

From:S G <sg.online.em...@gmail.com>
Sent: Monday 29th January 2018 20:15
To: solr-user@lucene.apache.org
Subject: Re: 7.2.1 cluster dies within minutes after restart

Hi Markus,

We are in the process of upgrading our clu

Re: 7.2.1 cluster dies within minutes after restart

2018-02-01 Thread S G
ok, good to know that 7.x shows good performance for you too.

1) Regarding the zookeeper problem, do you know for sure that it does not
occur in 6.x ?
 I would suggest to write a small load-test that can send a similar
kind of load to 6.x and 7.x clusters and see which one breaks.
 I know that these kind of problems can take days to occur but without
a reproducible pattern, it may be hard to fix.

2) Another thing is the zookeeper version.
7.x uses 3.4.10 version of zookeeper (See
https://github.com/apache/lucene-solr/blob/branch_7_2/lucene/ivy-versions.properties#L192
)
If you are using 3.4.10, try using 3.4.9 or vice versa.
Do not use zookeeper versions lower than 3.4.9 - they have some nasty
bugs.

3) Do take a look at zookeeper cluster too.
ZK has 4-letter commands like ruok, srvr etc that reveal a lot of its
internal activity.

4) Hopefully, you are not doing anything cross-DC as that could cause
network delays and cause such problems.

5) As far as I can remember, we have seen some zookeeper issues but they
were generally related to 3.4.6 version or
VMs getting replaced in cloud environment and the IP's not getting
refreshed in the ZK's configs.

That's all I could think of from a user's perspective  --\_(0.0)_/--

Thanks
SG



On Wed, Jan 31, 2018 at 1:56 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello S.G.
>
> We do not complain about speed improvements at all, it is clear 7.x is
> faster than its predecessor. The problem is stability and not recovering
> from weird circumstances. In general, it is our high load cluster
> containing user interaction logs that suffers the most. Our main text
> search cluster - receiving much fewer queries - seems mostly unaffected,
> except last Sunday. After very short but high burst of queries it entered
> the same catatonic state the logs cluster usually dies from.
>
> The query burst immediately caused ZK timeouts and high heap consumption
> (not sure which came first of the latter two). The query burst lasted for
> 30 minutes, the excessive heap consumption continued for more than 8 hours,
> before Solr finally realized it could relax. Most remarkable was that Solr
> recovered on its own, ZK timeouts stopped, heap went back to normal.
>
> There seems to be a causality between high load and this state.
>
> We really want to get this fixed for ourselves and everyone else that may
> encounter this problem, but i don't know how, so i need much more feedback
> and hints from those who have deep understanding of inner working of
> Solrcloud and changes since 6.x.
>
> To be clear, we don't have the problem of 15 second ZK timeout, we use 30.
> Is 30 too low still? Is it even remotely related to this problem? What does
> load have to do with it?
>
> We are not able to reproduce it in lab environments. It can take minutes
> after cluster startup for it to occur, but also days.
>
> I've been slightly annoyed by problems that can occur in a board time
> span, it is always bad luck for reproduction.
>
> Any help getting further is much appreciated.
>
> Many thanks,
> Markus
>
> -Original message-
> > From:S G <sg.online.em...@gmail.com>
> > Sent: Wednesday 31st January 2018 21:48
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.2.1 cluster dies within minutes after restart
> >
> > We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> > And that came out all right.
> > We saw a performance increase of about 30% in read latencies between
> 6.6.0
> > and 7.1.0
> > And then we saw a performance degradation of about 10% between 7.1.0 and
> > 7.2.1 in many metrics.
> > But overall, it still seems better than 6.6.0.
> >
> > I will check for the errors too in the logs but the nodes were responsive
> > for all the 23+ hours we did the load test.
> >
> > Disclaimer: We do not test facets and pivots or block-joins. And will add
> > those features to our load-testing tool sometime this year.
> >
> > Thanks
> > SG
> >
> >
> > On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Ah thanks, i just submitted a patch fixing it.
> > >
> > > Anyway, in the end it appears this is not the problem we are seeing as
> our
> > > timeouts were already at 30 seconds.
> > >
> > > All i know is that at some point nodes start to lose ZK connections
> due to
> > > timeouts (logs say so, but all within 30 seconds), the logs are flooded
> > > with those messages:
> > > o.a.z.ClientCnxn Client session timed out, have not heard from server
> in
> > > 10359ms for sessionid 0x160f9e723c12122
> > >

RE: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread Markus Jelsma
Hello S.G.

We do not complain about speed improvements at all, it is clear 7.x is faster 
than its predecessor. The problem is stability and not recovering from weird 
circumstances. In general, it is our high load cluster containing user 
interaction logs that suffers the most. Our main text search cluster - 
receiving much fewer queries - seems mostly unaffected, except last Sunday. 
After very short but high burst of queries it entered the same catatonic state 
the logs cluster usually dies from. 

The query burst immediately caused ZK timeouts and high heap consumption (not 
sure which came first of the latter two). The query burst lasted for 30 
minutes, the excessive heap consumption continued for more than 8 hours, before 
Solr finally realized it could relax. Most remarkable was that Solr recovered 
on its own, ZK timeouts stopped, heap went back to normal.

There seems to be a causality between high load and this state.

We really want to get this fixed for ourselves and everyone else that may 
encounter this problem, but i don't know how, so i need much more feedback and 
hints from those who have deep understanding of inner working of Solrcloud and 
changes since 6.x.

To be clear, we don't have the problem of 15 second ZK timeout, we use 30. Is 
30 too low still? Is it even remotely related to this problem? What does load 
have to do with it?

We are not able to reproduce it in lab environments. It can take minutes after 
cluster startup for it to occur, but also days. 

I've been slightly annoyed by problems that can occur in a board time span, it 
is always bad luck for reproduction.

Any help getting further is much appreciated.

Many thanks,
Markus
 
-Original message-
> From:S G <sg.online.em...@gmail.com>
> Sent: Wednesday 31st January 2018 21:48
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> And that came out all right.
> We saw a performance increase of about 30% in read latencies between 6.6.0
> and 7.1.0
> And then we saw a performance degradation of about 10% between 7.1.0 and
> 7.2.1 in many metrics.
> But overall, it still seems better than 6.6.0.
> 
> I will check for the errors too in the logs but the nodes were responsive
> for all the 23+ hours we did the load test.
> 
> Disclaimer: We do not test facets and pivots or block-joins. And will add
> those features to our load-testing tool sometime this year.
> 
> Thanks
> SG
> 
> 
> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Ah thanks, i just submitted a patch fixing it.
> >
> > Anyway, in the end it appears this is not the problem we are seeing as our
> > timeouts were already at 30 seconds.
> >
> > All i know is that at some point nodes start to lose ZK connections due to
> > timeouts (logs say so, but all within 30 seconds), the logs are flooded
> > with those messages:
> > o.a.z.ClientCnxn Client session timed out, have not heard from server in
> > 10359ms for sessionid 0x160f9e723c12122
> > o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> > 0x60f9e7234f05bb has expired
> >
> > Then there is a doubling in heap usage and nodes become unresponsive, die
> > etc.
> >
> > We also see those messages in other collections, but not so frequently and
> > they don't cause failure in those less loaded clusters.
> >
> > Ideas?
> >
> > Thanks,
> > Markus
> >
> > -Original message-
> > > From:Michael Braun <n3c...@gmail.com>
> > > Sent: Monday 29th January 2018 21:09
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > Believe this is reported in https://issues.apache.org/
> > jira/browse/SOLR-10471
> > >
> > >
> > > On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Hello SG,
> > > >
> > > > The default in solr.in.sh is commented so it defaults to the value
> > set in
> > > > bin/solr, which is fifteen seconds. Just uncomment the setting in
> > > > solr.in.sh and your timeout will be thirty seconds.
> > > >
> > > > For Solr itself to really default to thirty seconds, Solr's bin/solr
> > needs
> > > > to be patched to use the correct value.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:S G <sg.online.em...@gmail.com>
> &g

Re: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread S G
We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
And that came out all right.
We saw a performance increase of about 30% in read latencies between 6.6.0
and 7.1.0
And then we saw a performance degradation of about 10% between 7.1.0 and
7.2.1 in many metrics.
But overall, it still seems better than 6.6.0.

I will check for the errors too in the logs but the nodes were responsive
for all the 23+ hours we did the load test.

Disclaimer: We do not test facets and pivots or block-joins. And will add
those features to our load-testing tool sometime this year.

Thanks
SG


On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Ah thanks, i just submitted a patch fixing it.
>
> Anyway, in the end it appears this is not the problem we are seeing as our
> timeouts were already at 30 seconds.
>
> All i know is that at some point nodes start to lose ZK connections due to
> timeouts (logs say so, but all within 30 seconds), the logs are flooded
> with those messages:
> o.a.z.ClientCnxn Client session timed out, have not heard from server in
> 10359ms for sessionid 0x160f9e723c12122
> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> 0x60f9e7234f05bb has expired
>
> Then there is a doubling in heap usage and nodes become unresponsive, die
> etc.
>
> We also see those messages in other collections, but not so frequently and
> they don't cause failure in those less loaded clusters.
>
> Ideas?
>
> Thanks,
> Markus
>
> -Original message-
> > From:Michael Braun <n3c...@gmail.com>
> > Sent: Monday 29th January 2018 21:09
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.2.1 cluster dies within minutes after restart
> >
> > Believe this is reported in https://issues.apache.org/
> jira/browse/SOLR-10471
> >
> >
> > On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Hello SG,
> > >
> > > The default in solr.in.sh is commented so it defaults to the value
> set in
> > > bin/solr, which is fifteen seconds. Just uncomment the setting in
> > > solr.in.sh and your timeout will be thirty seconds.
> > >
> > > For Solr itself to really default to thirty seconds, Solr's bin/solr
> needs
> > > to be patched to use the correct value.
> > >
> > > Regards,
> > > Markus
> > >
> > > -Original message-
> > > > From:S G <sg.online.em...@gmail.com>
> > > > Sent: Monday 29th January 2018 20:15
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > >
> > > > Hi Markus,
> > > >
> > > > We are in the process of upgrading our clusters to 7.2.1 and I am not
> > > sure
> > > > I quite follow the conversation here.
> > > > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> > > value
> > > > in the config (and it's just a default value being wrong/overridden
> > > > somewhere)?
> > > > Or is it more severe in the sense that any config set for
> > > ZK_CLIENT_TIMEOUT
> > > > by the user is just ignored completely by Solr in 7.2.1 ?
> > > >
> > > > Thanks
> > > > SG
> > > >
> > > >
> > > > On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> > > markus.jel...@openindex.io>
> > > > wrote:
> > > >
> > > > > Ok, i applied the patch and it is clear the timeout is 15000.
> Solr.xml
> > > > > says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default
> unset
> > > in
> > > > > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's
> default is
> > > > > still 15000, not 3.
> > > > >
> > > > > But, back to my topic. I see we explicitly set it in solr.in.sh to
> > > 3.
> > > > > To be sure, i applied your patch to a production machine, all our
> > > > > collections run with 3. So how would that explain this log
> line?
> > > > >
> > > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> server
> > > in
> > > > > 22130ms
> > > > >
> > > > > We also see these with smaller values, seven seconds. And, is this
> > > > > actually an indicator of the problems we have?
> > > > >
> > > > > Any ideas?
> > > > >
> > > > > Many thank

RE: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread Markus Jelsma
Ah thanks, i just submitted a patch fixing it.

Anyway, in the end it appears this is not the problem we are seeing as our 
timeouts were already at 30 seconds.

All i know is that at some point nodes start to lose ZK connections due to 
timeouts (logs say so, but all within 30 seconds), the logs are flooded with 
those messages:
o.a.z.ClientCnxn Client session timed out, have not heard from server in 
10359ms for sessionid 0x160f9e723c12122
o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session 
0x60f9e7234f05bb has expired

Then there is a doubling in heap usage and nodes become unresponsive, die etc. 

We also see those messages in other collections, but not so frequently and they 
don't cause failure in those less loaded clusters.

Ideas?

Thanks,
Markus

-Original message-
> From:Michael Braun <n3c...@gmail.com>
> Sent: Monday 29th January 2018 21:09
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Believe this is reported in https://issues.apache.org/jira/browse/SOLR-10471
> 
> 
> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello SG,
> >
> > The default in solr.in.sh is commented so it defaults to the value set in
> > bin/solr, which is fifteen seconds. Just uncomment the setting in
> > solr.in.sh and your timeout will be thirty seconds.
> >
> > For Solr itself to really default to thirty seconds, Solr's bin/solr needs
> > to be patched to use the correct value.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> > > From:S G <sg.online.em...@gmail.com>
> > > Sent: Monday 29th January 2018 20:15
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > Hi Markus,
> > >
> > > We are in the process of upgrading our clusters to 7.2.1 and I am not
> > sure
> > > I quite follow the conversation here.
> > > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> > value
> > > in the config (and it's just a default value being wrong/overridden
> > > somewhere)?
> > > Or is it more severe in the sense that any config set for
> > ZK_CLIENT_TIMEOUT
> > > by the user is just ignored completely by Solr in 7.2.1 ?
> > >
> > > Thanks
> > > SG
> > >
> > >
> > > On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml
> > > > says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default unset
> > in
> > > > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is
> > > > still 15000, not 3.
> > > >
> > > > But, back to my topic. I see we explicitly set it in solr.in.sh to
> > 3.
> > > > To be sure, i applied your patch to a production machine, all our
> > > > collections run with 3. So how would that explain this log line?
> > > >
> > > > o.a.z.ClientCnxn Client session timed out, have not heard from server
> > in
> > > > 22130ms
> > > >
> > > > We also see these with smaller values, seven seconds. And, is this
> > > > actually an indicator of the problems we have?
> > > >
> > > > Any ideas?
> > > >
> > > > Many thanks,
> > > > Markus
> > > >
> > > >
> > > > -Original message-
> > > > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > > > Sent: Saturday 27th January 2018 10:03
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: RE: 7.2.1 cluster dies within minutes after restart
> > > > >
> > > > > Hello,
> > > > >
> > > > > I grepped for it yesterday and found nothing but 30000 in the
> > settings,
> > > > but judging from the weird time out value, you may be right. Let me
> > apply
> > > > your patch early next week and check for spurious warnings.
> > > > >
> > > > > Another note worthy observation for those working on cloud stability
> > and
> > > > recovery, whenever this happens, some nodes are also absolutely sure
> > to run
> > > > OOM. The leaders usually live longest, the replica's don't, their heap
> > > > usage peaks every time, consistently.
&g

Re: 7.2.1 cluster dies within minutes after restart

2018-01-29 Thread Michael Braun
Believe this is reported in https://issues.apache.org/jira/browse/SOLR-10471


On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello SG,
>
> The default in solr.in.sh is commented so it defaults to the value set in
> bin/solr, which is fifteen seconds. Just uncomment the setting in
> solr.in.sh and your timeout will be thirty seconds.
>
> For Solr itself to really default to thirty seconds, Solr's bin/solr needs
> to be patched to use the correct value.
>
> Regards,
> Markus
>
> -Original message-
> > From:S G <sg.online.em...@gmail.com>
> > Sent: Monday 29th January 2018 20:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.2.1 cluster dies within minutes after restart
> >
> > Hi Markus,
> >
> > We are in the process of upgrading our clusters to 7.2.1 and I am not
> sure
> > I quite follow the conversation here.
> > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> value
> > in the config (and it's just a default value being wrong/overridden
> > somewhere)?
> > Or is it more severe in the sense that any config set for
> ZK_CLIENT_TIMEOUT
> > by the user is just ignored completely by Solr in 7.2.1 ?
> >
> > Thanks
> > SG
> >
> >
> > On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml
> > > says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default unset
> in
> > > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is
> > > still 15000, not 3.
> > >
> > > But, back to my topic. I see we explicitly set it in solr.in.sh to
> 3.
> > > To be sure, i applied your patch to a production machine, all our
> > > collections run with 3. So how would that explain this log line?
> > >
> > > o.a.z.ClientCnxn Client session timed out, have not heard from server
> in
> > > 22130ms
> > >
> > > We also see these with smaller values, seven seconds. And, is this
> > > actually an indicator of the problems we have?
> > >
> > > Any ideas?
> > >
> > > Many thanks,
> > > Markus
> > >
> > >
> > > -Original message-
> > > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > > Sent: Saturday 27th January 2018 10:03
> > > > To: solr-user@lucene.apache.org
> > > > Subject: RE: 7.2.1 cluster dies within minutes after restart
> > > >
> > > > Hello,
> > > >
> > > > I grepped for it yesterday and found nothing but 3 in the
> settings,
> > > but judging from the weird time out value, you may be right. Let me
> apply
> > > your patch early next week and check for spurious warnings.
> > > >
> > > > Another note worthy observation for those working on cloud stability
> and
> > > recovery, whenever this happens, some nodes are also absolutely sure
> to run
> > > OOM. The leaders usually live longest, the replica's don't, their heap
> > > usage peaks every time, consistently.
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:Shawn Heisey <apa...@elyograg.org>
> > > > > Sent: Saturday 27th January 2018 0:49
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > > >
> > > > > On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > > > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> > > server in 22130ms (although zkClientTimeOut is 3).
> > > > >
> > > > > Are you absolutely certain that there is a setting for
> zkClientTimeout
> > > > > that is actually getting applied?  The default value in Solr's
> example
> > > > > configs is 30 seconds, but the internal default in the code (when
> no
> > > > > configuration is found) is still 15.  I have confirmed this in the
> > > code.
> > > > >
> > > > > Looks like SolrCloud doesn't log the values it's using for things
> like
> > > > > zkClientTimeout.  I think it should.
> > > > >
> > > > > https://issues.apache.org/jira/browse/SOLR-11915
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > > > >
> > > >
> > >
> >
>


RE: 7.2.1 cluster dies within minutes after restart

2018-01-29 Thread Markus Jelsma
Hello SG,

The default in solr.in.sh is commented so it defaults to the value set in 
bin/solr, which is fifteen seconds. Just uncomment the setting in solr.in.sh 
and your timeout will be thirty seconds.

For Solr itself to really default to thirty seconds, Solr's bin/solr needs to 
be patched to use the correct value.

Regards,
Markus
 
-Original message-
> From:S G <sg.online.em...@gmail.com>
> Sent: Monday 29th January 2018 20:15
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Hi Markus,
> 
> We are in the process of upgrading our clusters to 7.2.1 and I am not sure
> I quite follow the conversation here.
> Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher value
> in the config (and it's just a default value being wrong/overridden
> somewhere)?
> Or is it more severe in the sense that any config set for ZK_CLIENT_TIMEOUT
> by the user is just ignored completely by Solr in 7.2.1 ?
> 
> Thanks
> SG
> 
> 
> On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml
> > says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default unset in
> > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is
> > still 15000, not 3.
> >
> > But, back to my topic. I see we explicitly set it in solr.in.sh to 3.
> > To be sure, i applied your patch to a production machine, all our
> > collections run with 3. So how would that explain this log line?
> >
> > o.a.z.ClientCnxn Client session timed out, have not heard from server in
> > 22130ms
> >
> > We also see these with smaller values, seven seconds. And, is this
> > actually an indicator of the problems we have?
> >
> > Any ideas?
> >
> > Many thanks,
> > Markus
> >
> >
> > -Original message-
> > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: Saturday 27th January 2018 10:03
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: 7.2.1 cluster dies within minutes after restart
> > >
> > > Hello,
> > >
> > > I grepped for it yesterday and found nothing but 3 in the settings,
> > but judging from the weird time out value, you may be right. Let me apply
> > your patch early next week and check for spurious warnings.
> > >
> > > Another note worthy observation for those working on cloud stability and
> > recovery, whenever this happens, some nodes are also absolutely sure to run
> > OOM. The leaders usually live longest, the replica's don't, their heap
> > usage peaks every time, consistently.
> > >
> > > Thanks,
> > > Markus
> > >
> > > -Original message-
> > > > From:Shawn Heisey <apa...@elyograg.org>
> > > > Sent: Saturday 27th January 2018 0:49
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > >
> > > > On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> > server in 22130ms (although zkClientTimeOut is 3).
> > > >
> > > > Are you absolutely certain that there is a setting for zkClientTimeout
> > > > that is actually getting applied?  The default value in Solr's example
> > > > configs is 30 seconds, but the internal default in the code (when no
> > > > configuration is found) is still 15.  I have confirmed this in the
> > code.
> > > >
> > > > Looks like SolrCloud doesn't log the values it's using for things like
> > > > zkClientTimeout.  I think it should.
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-11915
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
> 


Re: 7.2.1 cluster dies within minutes after restart

2018-01-29 Thread S G
Hi Markus,

We are in the process of upgrading our clusters to 7.2.1 and I am not sure
I quite follow the conversation here.
Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher value
in the config (and it's just a default value being wrong/overridden
somewhere)?
Or is it more severe in the sense that any config set for ZK_CLIENT_TIMEOUT
by the user is just ignored completely by Solr in 7.2.1 ?

Thanks
SG


On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml
> says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default unset in
> solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is
> still 15000, not 3.
>
> But, back to my topic. I see we explicitly set it in solr.in.sh to 3.
> To be sure, i applied your patch to a production machine, all our
> collections run with 3. So how would that explain this log line?
>
> o.a.z.ClientCnxn Client session timed out, have not heard from server in
> 22130ms
>
> We also see these with smaller values, seven seconds. And, is this
> actually an indicator of the problems we have?
>
> Any ideas?
>
> Many thanks,
> Markus
>
>
> -Original message-
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Saturday 27th January 2018 10:03
> > To: solr-user@lucene.apache.org
> > Subject: RE: 7.2.1 cluster dies within minutes after restart
> >
> > Hello,
> >
> > I grepped for it yesterday and found nothing but 3 in the settings,
> but judging from the weird time out value, you may be right. Let me apply
> your patch early next week and check for spurious warnings.
> >
> > Another note worthy observation for those working on cloud stability and
> recovery, whenever this happens, some nodes are also absolutely sure to run
> OOM. The leaders usually live longest, the replica's don't, their heap
> usage peaks every time, consistently.
> >
> > Thanks,
> > Markus
> >
> > -Original message-----
> > > From:Shawn Heisey <apa...@elyograg.org>
> > > Sent: Saturday 27th January 2018 0:49
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > >
> > > On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> server in 22130ms (although zkClientTimeOut is 3).
> > >
> > > Are you absolutely certain that there is a setting for zkClientTimeout
> > > that is actually getting applied?  The default value in Solr's example
> > > configs is 30 seconds, but the internal default in the code (when no
> > > configuration is found) is still 15.  I have confirmed this in the
> code.
> > >
> > > Looks like SolrCloud doesn't log the values it's using for things like
> > > zkClientTimeout.  I think it should.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-11915
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>


RE: 7.2.1 cluster dies within minutes after restart

2018-01-29 Thread Markus Jelsma
Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml says 
3 if ZK_CLIENT_TIMEOUT is not set, which is by default unset in 
solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is still 
15000, not 3.

But, back to my topic. I see we explicitly set it in solr.in.sh to 3. To be 
sure, i applied your patch to a production machine, all our collections run 
with 3. So how would that explain this log line?

o.a.z.ClientCnxn Client session timed out, have not heard from server in 22130ms

We also see these with smaller values, seven seconds. And, is this actually an 
indicator of the problems we have?

Any ideas?

Many thanks,
Markus
 
 
-Original message-
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Saturday 27th January 2018 10:03
> To: solr-user@lucene.apache.org
> Subject: RE: 7.2.1 cluster dies within minutes after restart
> 
> Hello,
> 
> I grepped for it yesterday and found nothing but 3 in the settings, but 
> judging from the weird time out value, you may be right. Let me apply your 
> patch early next week and check for spurious warnings.
> 
> Another note worthy observation for those working on cloud stability and 
> recovery, whenever this happens, some nodes are also absolutely sure to run 
> OOM. The leaders usually live longest, the replica's don't, their heap usage 
> peaks every time, consistently. 
> 
> Thanks,
> Markus
>  
> -Original message-
> > From:Shawn Heisey <apa...@elyograg.org>
> > Sent: Saturday 27th January 2018 0:49
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > 
> > On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > > o.a.z.ClientCnxn Client session timed out, have not heard from server in 
> > > 22130ms (although zkClientTimeOut is 3).
> > 
> > Are you absolutely certain that there is a setting for zkClientTimeout
> > that is actually getting applied?  The default value in Solr's example
> > configs is 30 seconds, but the internal default in the code (when no
> > configuration is found) is still 15.  I have confirmed this in the code.
> > 
> > Looks like SolrCloud doesn't log the values it's using for things like
> > zkClientTimeout.  I think it should.
> > 
> > https://issues.apache.org/jira/browse/SOLR-11915
> > 
> > Thanks,
> > Shawn
> > 
> > 
> 


RE: 7.2.1 cluster dies within minutes after restart

2018-01-27 Thread Markus Jelsma
Hello,

I grepped for it yesterday and found nothing but 3 in the settings, but 
judging from the weird time out value, you may be right. Let me apply your 
patch early next week and check for spurious warnings.

Another note worthy observation for those working on cloud stability and 
recovery, whenever this happens, some nodes are also absolutely sure to run 
OOM. The leaders usually live longest, the replica's don't, their heap usage 
peaks every time, consistently. 

Thanks,
Markus
 
-Original message-
> From:Shawn Heisey <apa...@elyograg.org>
> Sent: Saturday 27th January 2018 0:49
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > o.a.z.ClientCnxn Client session timed out, have not heard from server in 
> > 22130ms (although zkClientTimeOut is 3).
> 
> Are you absolutely certain that there is a setting for zkClientTimeout
> that is actually getting applied?  The default value in Solr's example
> configs is 30 seconds, but the internal default in the code (when no
> configuration is found) is still 15.  I have confirmed this in the code.
> 
> Looks like SolrCloud doesn't log the values it's using for things like
> zkClientTimeout.  I think it should.
> 
> https://issues.apache.org/jira/browse/SOLR-11915
> 
> Thanks,
> Shawn
> 
> 


Re: 7.2.1 cluster dies within minutes after restart

2018-01-26 Thread Shawn Heisey
On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> o.a.z.ClientCnxn Client session timed out, have not heard from server in 
> 22130ms (although zkClientTimeOut is 3).

Are you absolutely certain that there is a setting for zkClientTimeout
that is actually getting applied?  The default value in Solr's example
configs is 30 seconds, but the internal default in the code (when no
configuration is found) is still 15.  I have confirmed this in the code.

Looks like SolrCloud doesn't log the values it's using for things like
zkClientTimeout.  I think it should.

https://issues.apache.org/jira/browse/SOLR-11915

Thanks,
Shawn