Hi Philippa, Try taking a heap dump (when heap usage is high) and then using a profiler look at which objects are taking up most of the memory. I have seen that if you are using faceting/sorting on large number of documents then fieldCache grows very big and dominates most of of the heap. Enabling docValues on the fields you are sorting/faceting on helps.
On 8 December 2015 at 07:17, philippa griggs <philippa.gri...@hotmail.co.uk> wrote: > Hello Emir, > > The query load is around 35 requests per min on each shard, we don't > document route so we query the entire index. > > We do have some heavy queries like faceting and its possible that a heavy > queries is causing the nodes to go down- we are looking into this. I'm new > to solr so this could be a slightly stupid question but would a heavy query > cause most of the nodes to go down? This didn't happen with the previous > solr version we were using Solr 4.10.0, we did have nodes/shards which went > down but there wasn't wipe out effect where most of the nodes go. > > Many thanks > > Philippa > > ________________________________________ > From: Emir Arnautovic <emir.arnauto...@sematext.com> > Sent: 08 December 2015 10:38 > To: solr-user@lucene.apache.org > Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. > > Hi Phillippa, > My guess would be that you are running some heavy queries (faceting/deep > paging/large pages) or have high query load (can you give bit details > about load) or have misconfigured caches. Do you query entire index or > you have query routing? > > You have big machine and might consider running two Solr on each node > (with smaller heap) and split shards so queries can be more > parallelized, resources better utilized, and smaller heap to GC. > > Regards, > Emir > > On 08.12.2015 10:49, philippa griggs wrote: > > Hello Erick, > > > > Thanks for your reply. > > > > We have one collection and are writing documents to that collection all > the time- it peaks at around 2,500 per minute and dips to 250 per minute, > the size of the document varies. On each node we have around 55,000,000 > documents with a data size of 43G located on a drive of 200G. > > > > Each node has 122G memory, the heap size is currently set at 45G > although we have plans to increase this to 50G. > > > > The heap settings we are using are: > > > > -XX: +UseG1GC, > > -XX:+ParallelRefProcEnabled. > > > > Please let me know if you need any more information. > > > > Philippa > > ________________________________________ > > From: Erick Erickson <erickerick...@gmail.com> > > Sent: 07 December 2015 16:53 > > To: solr-user > > Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. > > > > Tell us a bit more. > > > > Are you adding documents to your collections or adding more > > collections? Solr is a balancing act between the number of docs you > > have on each node and the memory you have allocated. If you're > > continually adding docs to Solr, you'll eventually run out of memory > > and/or hit big GC pauses. > > > > How much memory are you allocating to Solr? How much physical memory > > to you have? etc. > > > > Best, > > Erick > > > > > > On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs > > <philippa.gri...@hotmail.co.uk> wrote: > >> Hello, > >> > >> > >> I'm using: > >> > >> > >> Solr 5.2.1 10 shards each with a replica. (20 nodes in total) > >> > >> > >> Zookeeper 3.4.6. > >> > >> > >> About half a year ago we upgraded to Solr 5.2.1 and since then have > been experiencing a 'wipe out' effect where all of a sudden most if not all > nodes will go down. Sometimes they will recover by themselves but more > often than not we have to step in to restart nodes. > >> > >> > >> Nothing in the logs jumps out as being the problem. With the latest > wipe out we noticed that 10 out of the 20 nodes had garbage collections > over 1min all at the same time, with the heap usage spiking up in some > cases to 80%. We also noticed the amount of selects run on the solr cluster > increased just before the wipe out. > >> > >> > >> Increasing the heap size seems to help for a while but then it starts > happening again- so its more like a delay than a fix. Our GC settings are > set to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. > >> > >> > >> With our previous version of solr (4.10.0) this didn't happen. We had > nodes/shards go down but it was contained, with the new version they all > seem to go at around the same time. We can't really continue just > increasing the heap size and would like to solve this issue rather than > delay it. > >> > >> > >> Has anyone experienced something simular? > >> > >> Is there a difference between the two versions around the recovery > process? > >> > >> Does anyone have any suggestions on a fix. > >> > >> > >> Many thanks > >> > >> > >> Philippa > > > > > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > >