Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Hi Philippa, Try taking a heap dump (when heap usage is high) and then using a profiler look at which objects are taking up most of the memory. I have seen that if you are using faceting/sorting on large number of documents then fieldCache grows very big and dominates most of of the heap. Enabling docValues on the fields you are sorting/faceting on helps. On 8 December 2015 at 07:17, philippa griggs wrote: > Hello Emir, > > The query load is around 35 requests per min on each shard, we don't > document route so we query the entire index. > > We do have some heavy queries like faceting and its possible that a heavy > queries is causing the nodes to go down- we are looking into this. I'm new > to solr so this could be a slightly stupid question but would a heavy query > cause most of the nodes to go down? This didn't happen with the previous > solr version we were using Solr 4.10.0, we did have nodes/shards which went > down but there wasn't wipe out effect where most of the nodes go. > > Many thanks > > Philippa > > > From: Emir Arnautovic > Sent: 08 December 2015 10:38 > To: solr-user@lucene.apache.org > Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. > > Hi Phillippa, > My guess would be that you are running some heavy queries (faceting/deep > paging/large pages) or have high query load (can you give bit details > about load) or have misconfigured caches. Do you query entire index or > you have query routing? > > You have big machine and might consider running two Solr on each node > (with smaller heap) and split shards so queries can be more > parallelized, resources better utilized, and smaller heap to GC. > > Regards, > Emir > > On 08.12.2015 10:49, philippa griggs wrote: > > Hello Erick, > > > > Thanks for your reply. > > > > We have one collection and are writing documents to that collection all > the time- it peaks at around 2,500 per minute and dips to 250 per minute, > the size of the document varies. On each node we have around 55,000,000 > documents with a data size of 43G located on a drive of 200G. > > > > Each node has 122G memory, the heap size is currently set at 45G > although we have plans to increase this to 50G. > > > > The heap settings we are using are: > > > > -XX: +UseG1GC, > > -XX:+ParallelRefProcEnabled. > > > > Please let me know if you need any more information. > > > > Philippa > > > > From: Erick Erickson > > Sent: 07 December 2015 16:53 > > To: solr-user > > Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. > > > > Tell us a bit more. > > > > Are you adding documents to your collections or adding more > > collections? Solr is a balancing act between the number of docs you > > have on each node and the memory you have allocated. If you're > > continually adding docs to Solr, you'll eventually run out of memory > > and/or hit big GC pauses. > > > > How much memory are you allocating to Solr? How much physical memory > > to you have? etc. > > > > Best, > > Erick > > > > > > On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs > > wrote: > >> Hello, > >> > >> > >> I'm using: > >> > >> > >> Solr 5.2.1 10 shards each with a replica. (20 nodes in total) > >> > >> > >> Zookeeper 3.4.6. > >> > >> > >> About half a year ago we upgraded to Solr 5.2.1 and since then have > been experiencing a 'wipe out' effect where all of a sudden most if not all > nodes will go down. Sometimes they will recover by themselves but more > often than not we have to step in to restart nodes. > >> > >> > >> Nothing in the logs jumps out as being the problem. With the latest > wipe out we noticed that 10 out of the 20 nodes had garbage collections > over 1min all at the same time, with the heap usage spiking up in some > cases to 80%. We also noticed the amount of selects run on the solr cluster > increased just before the wipe out. > >> > >> > >> Increasing the heap size seems to help for a while but then it starts > happening again- so its more like a delay than a fix. Our GC settings are > set to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. > >> > >> > >> With our previous version of solr (4.10.0) this didn't happen. We had > nodes/shards go down but it was contained, with the new version they all > seem to go at around the same time. We can't really continue just > increasing the heap size and would like to solve this issue rather than > delay it. > >> > >> > >> Has anyone experienced something simular? > >> > >> Is there a difference between the two versions around the recovery > process? > >> > >> Does anyone have any suggestions on a fix. > >> > >> > >> Many thanks > >> > >> > >> Philippa > > > > > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > >
Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Phillippa: You simply cannot continue adding documents, increasing memory, adding more documents, increasing memory forever, if for no other reason than you'll eventually hit such large GC pauses that your query performance will suffer greatly. I'd _strongly_ advise you to pick a number of docs (let's say 50M, but you could make it smaller or larger, up to you) as the maximum number of docs you can put in a shard, then create enough shards to accommodate your eventual total corpus. This may mean "oversharding", where you host multiple shards in the same JVM and then move them to new hardware as your doc load on any particular JVM exceeds 50M (i.e. say 10M docs on each of 5 nodes). IMO, though, the path you're on is untenable in the long run. You either have to plan for total capacity or prune your corpus. Best, Erick On Tue, Dec 8, 2015 at 6:06 AM, Emir Arnautovic wrote: > Hi Philippa, > It's more likely that this is related to index size/content + queries than > to Solr version. Did you experience issues immediately after upgrade? > > Check slow queries log and see if there are some extremely slow queries. > Check cache sizes and calculate how much they take. Increasing heap size is > not likely to help - it might postpone issue but will be harder when it > hits. > > Thanks, > Emir > > > On 08.12.2015 13:17, philippa griggs wrote: >> >> Hello Emir, >> >> The query load is around 35 requests per min on each shard, we don't >> document route so we query the entire index. >> >> We do have some heavy queries like faceting and its possible that a heavy >> queries is causing the nodes to go down- we are looking into this. I'm new >> to solr so this could be a slightly stupid question but would a heavy query >> cause most of the nodes to go down? This didn't happen with the previous >> solr version we were using Solr 4.10.0, we did have nodes/shards which went >> down but there wasn't wipe out effect where most of the nodes go. >> >> Many thanks >> >> Philippa >> >> ____ >> From: Emir Arnautovic >> Sent: 08 December 2015 10:38 >> To: solr-user@lucene.apache.org >> Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. >> >> Hi Phillippa, >> My guess would be that you are running some heavy queries (faceting/deep >> paging/large pages) or have high query load (can you give bit details >> about load) or have misconfigured caches. Do you query entire index or >> you have query routing? >> >> You have big machine and might consider running two Solr on each node >> (with smaller heap) and split shards so queries can be more >> parallelized, resources better utilized, and smaller heap to GC. >> >> Regards, >> Emir >> >> On 08.12.2015 10:49, philippa griggs wrote: >>> >>> Hello Erick, >>> >>> Thanks for your reply. >>> >>> We have one collection and are writing documents to that collection all >>> the time- it peaks at around 2,500 per minute and dips to 250 per minute, >>> the size of the document varies. On each node we have around 55,000,000 >>> documents with a data size of 43G located on a drive of 200G. >>> >>> Each node has 122G memory, the heap size is currently set at 45G although >>> we have plans to increase this to 50G. >>> >>> The heap settings we are using are: >>> >>>-XX: +UseG1GC, >>> -XX:+ParallelRefProcEnabled. >>> >>> Please let me know if you need any more information. >>> >>> Philippa >>> >>> From: Erick Erickson >>> Sent: 07 December 2015 16:53 >>> To: solr-user >>> Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. >>> >>> Tell us a bit more. >>> >>> Are you adding documents to your collections or adding more >>> collections? Solr is a balancing act between the number of docs you >>> have on each node and the memory you have allocated. If you're >>> continually adding docs to Solr, you'll eventually run out of memory >>> and/or hit big GC pauses. >>> >>> How much memory are you allocating to Solr? How much physical memory >>> to you have? etc. >>> >>> Best, >>> Erick >>> >>> >>> On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs >>> wrote: >>>> >>>> Hello, >>>> >>>> >>>> I'
Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Hi Philippa, It's more likely that this is related to index size/content + queries than to Solr version. Did you experience issues immediately after upgrade? Check slow queries log and see if there are some extremely slow queries. Check cache sizes and calculate how much they take. Increasing heap size is not likely to help - it might postpone issue but will be harder when it hits. Thanks, Emir On 08.12.2015 13:17, philippa griggs wrote: Hello Emir, The query load is around 35 requests per min on each shard, we don't document route so we query the entire index. We do have some heavy queries like faceting and its possible that a heavy queries is causing the nodes to go down- we are looking into this. I'm new to solr so this could be a slightly stupid question but would a heavy query cause most of the nodes to go down? This didn't happen with the previous solr version we were using Solr 4.10.0, we did have nodes/shards which went down but there wasn't wipe out effect where most of the nodes go. Many thanks Philippa From: Emir Arnautovic Sent: 08 December 2015 10:38 To: solr-user@lucene.apache.org Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. Hi Phillippa, My guess would be that you are running some heavy queries (faceting/deep paging/large pages) or have high query load (can you give bit details about load) or have misconfigured caches. Do you query entire index or you have query routing? You have big machine and might consider running two Solr on each node (with smaller heap) and split shards so queries can be more parallelized, resources better utilized, and smaller heap to GC. Regards, Emir On 08.12.2015 10:49, philippa griggs wrote: Hello Erick, Thanks for your reply. We have one collection and are writing documents to that collection all the time- it peaks at around 2,500 per minute and dips to 250 per minute, the size of the document varies. On each node we have around 55,000,000 documents with a data size of 43G located on a drive of 200G. Each node has 122G memory, the heap size is currently set at 45G although we have plans to increase this to 50G. The heap settings we are using are: -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. Please let me know if you need any more information. Philippa From: Erick Erickson Sent: 07 December 2015 16:53 To: solr-user Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. Tell us a bit more. Are you adding documents to your collections or adding more collections? Solr is a balancing act between the number of docs you have on each node and the memory you have allocated. If you're continually adding docs to Solr, you'll eventually run out of memory and/or hit big GC pauses. How much memory are you allocating to Solr? How much physical memory to you have? etc. Best, Erick On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs wrote: Hello, I'm using: Solr 5.2.1 10 shards each with a replica. (20 nodes in total) Zookeeper 3.4.6. About half a year ago we upgraded to Solr 5.2.1 and since then have been experiencing a 'wipe out' effect where all of a sudden most if not all nodes will go down. Sometimes they will recover by themselves but more often than not we have to step in to restart nodes. Nothing in the logs jumps out as being the problem. With the latest wipe out we noticed that 10 out of the 20 nodes had garbage collections over 1min all at the same time, with the heap usage spiking up in some cases to 80%. We also noticed the amount of selects run on the solr cluster increased just before the wipe out. Increasing the heap size seems to help for a while but then it starts happening again- so its more like a delay than a fix. Our GC settings are set to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. With our previous version of solr (4.10.0) this didn't happen. We had nodes/shards go down but it was contained, with the new version they all seem to go at around the same time. We can't really continue just increasing the heap size and would like to solve this issue rather than delay it. Has anyone experienced something simular? Is there a difference between the two versions around the recovery process? Does anyone have any suggestions on a fix. Many thanks Philippa -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Hello Emir, The query load is around 35 requests per min on each shard, we don't document route so we query the entire index. We do have some heavy queries like faceting and its possible that a heavy queries is causing the nodes to go down- we are looking into this. I'm new to solr so this could be a slightly stupid question but would a heavy query cause most of the nodes to go down? This didn't happen with the previous solr version we were using Solr 4.10.0, we did have nodes/shards which went down but there wasn't wipe out effect where most of the nodes go. Many thanks Philippa From: Emir Arnautovic Sent: 08 December 2015 10:38 To: solr-user@lucene.apache.org Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. Hi Phillippa, My guess would be that you are running some heavy queries (faceting/deep paging/large pages) or have high query load (can you give bit details about load) or have misconfigured caches. Do you query entire index or you have query routing? You have big machine and might consider running two Solr on each node (with smaller heap) and split shards so queries can be more parallelized, resources better utilized, and smaller heap to GC. Regards, Emir On 08.12.2015 10:49, philippa griggs wrote: > Hello Erick, > > Thanks for your reply. > > We have one collection and are writing documents to that collection all the > time- it peaks at around 2,500 per minute and dips to 250 per minute, the > size of the document varies. On each node we have around 55,000,000 documents > with a data size of 43G located on a drive of 200G. > > Each node has 122G memory, the heap size is currently set at 45G although we > have plans to increase this to 50G. > > The heap settings we are using are: > > -XX: +UseG1GC, > -XX:+ParallelRefProcEnabled. > > Please let me know if you need any more information. > > Philippa > > From: Erick Erickson > Sent: 07 December 2015 16:53 > To: solr-user > Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. > > Tell us a bit more. > > Are you adding documents to your collections or adding more > collections? Solr is a balancing act between the number of docs you > have on each node and the memory you have allocated. If you're > continually adding docs to Solr, you'll eventually run out of memory > and/or hit big GC pauses. > > How much memory are you allocating to Solr? How much physical memory > to you have? etc. > > Best, > Erick > > > On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs > wrote: >> Hello, >> >> >> I'm using: >> >> >> Solr 5.2.1 10 shards each with a replica. (20 nodes in total) >> >> >> Zookeeper 3.4.6. >> >> >> About half a year ago we upgraded to Solr 5.2.1 and since then have been >> experiencing a 'wipe out' effect where all of a sudden most if not all nodes >> will go down. Sometimes they will recover by themselves but more often than >> not we have to step in to restart nodes. >> >> >> Nothing in the logs jumps out as being the problem. With the latest wipe out >> we noticed that 10 out of the 20 nodes had garbage collections over 1min all >> at the same time, with the heap usage spiking up in some cases to 80%. We >> also noticed the amount of selects run on the solr cluster increased just >> before the wipe out. >> >> >> Increasing the heap size seems to help for a while but then it starts >> happening again- so its more like a delay than a fix. Our GC settings are >> set to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. >> >> >> With our previous version of solr (4.10.0) this didn't happen. We had >> nodes/shards go down but it was contained, with the new version they all >> seem to go at around the same time. We can't really continue just increasing >> the heap size and would like to solve this issue rather than delay it. >> >> >> Has anyone experienced something simular? >> >> Is there a difference between the two versions around the recovery process? >> >> Does anyone have any suggestions on a fix. >> >> >> Many thanks >> >> >> Philippa > > -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Hi Phillippa, My guess would be that you are running some heavy queries (faceting/deep paging/large pages) or have high query load (can you give bit details about load) or have misconfigured caches. Do you query entire index or you have query routing? You have big machine and might consider running two Solr on each node (with smaller heap) and split shards so queries can be more parallelized, resources better utilized, and smaller heap to GC. Regards, Emir On 08.12.2015 10:49, philippa griggs wrote: Hello Erick, Thanks for your reply. We have one collection and are writing documents to that collection all the time- it peaks at around 2,500 per minute and dips to 250 per minute, the size of the document varies. On each node we have around 55,000,000 documents with a data size of 43G located on a drive of 200G. Each node has 122G memory, the heap size is currently set at 45G although we have plans to increase this to 50G. The heap settings we are using are: -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. Please let me know if you need any more information. Philippa From: Erick Erickson Sent: 07 December 2015 16:53 To: solr-user Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. Tell us a bit more. Are you adding documents to your collections or adding more collections? Solr is a balancing act between the number of docs you have on each node and the memory you have allocated. If you're continually adding docs to Solr, you'll eventually run out of memory and/or hit big GC pauses. How much memory are you allocating to Solr? How much physical memory to you have? etc. Best, Erick On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs wrote: Hello, I'm using: Solr 5.2.1 10 shards each with a replica. (20 nodes in total) Zookeeper 3.4.6. About half a year ago we upgraded to Solr 5.2.1 and since then have been experiencing a 'wipe out' effect where all of a sudden most if not all nodes will go down. Sometimes they will recover by themselves but more often than not we have to step in to restart nodes. Nothing in the logs jumps out as being the problem. With the latest wipe out we noticed that 10 out of the 20 nodes had garbage collections over 1min all at the same time, with the heap usage spiking up in some cases to 80%. We also noticed the amount of selects run on the solr cluster increased just before the wipe out. Increasing the heap size seems to help for a while but then it starts happening again- so its more like a delay than a fix. Our GC settings are set to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. With our previous version of solr (4.10.0) this didn't happen. We had nodes/shards go down but it was contained, with the new version they all seem to go at around the same time. We can't really continue just increasing the heap size and would like to solve this issue rather than delay it. Has anyone experienced something simular? Is there a difference between the two versions around the recovery process? Does anyone have any suggestions on a fix. Many thanks Philippa > -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Hello Erick, Thanks for your reply. We have one collection and are writing documents to that collection all the time- it peaks at around 2,500 per minute and dips to 250 per minute, the size of the document varies. On each node we have around 55,000,000 documents with a data size of 43G located on a drive of 200G. Each node has 122G memory, the heap size is currently set at 45G although we have plans to increase this to 50G. The heap settings we are using are: -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. Please let me know if you need any more information. Philippa From: Erick Erickson Sent: 07 December 2015 16:53 To: solr-user Subject: Re: Solr 5.2.1 Most solr nodes in a cluster going down at once. Tell us a bit more. Are you adding documents to your collections or adding more collections? Solr is a balancing act between the number of docs you have on each node and the memory you have allocated. If you're continually adding docs to Solr, you'll eventually run out of memory and/or hit big GC pauses. How much memory are you allocating to Solr? How much physical memory to you have? etc. Best, Erick On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs wrote: > Hello, > > > I'm using: > > > Solr 5.2.1 10 shards each with a replica. (20 nodes in total) > > > Zookeeper 3.4.6. > > > About half a year ago we upgraded to Solr 5.2.1 and since then have been > experiencing a 'wipe out' effect where all of a sudden most if not all nodes > will go down. Sometimes they will recover by themselves but more often than > not we have to step in to restart nodes. > > > Nothing in the logs jumps out as being the problem. With the latest wipe out > we noticed that 10 out of the 20 nodes had garbage collections over 1min all > at the same time, with the heap usage spiking up in some cases to 80%. We > also noticed the amount of selects run on the solr cluster increased just > before the wipe out. > > > Increasing the heap size seems to help for a while but then it starts > happening again- so its more like a delay than a fix. Our GC settings are set > to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. > > > With our previous version of solr (4.10.0) this didn't happen. We had > nodes/shards go down but it was contained, with the new version they all seem > to go at around the same time. We can't really continue just increasing the > heap size and would like to solve this issue rather than delay it. > > > Has anyone experienced something simular? > > Is there a difference between the two versions around the recovery process? > > Does anyone have any suggestions on a fix. > > > Many thanks > > > Philippa >
Re: Solr 5.2.1 Most solr nodes in a cluster going down at once.
Tell us a bit more. Are you adding documents to your collections or adding more collections? Solr is a balancing act between the number of docs you have on each node and the memory you have allocated. If you're continually adding docs to Solr, you'll eventually run out of memory and/or hit big GC pauses. How much memory are you allocating to Solr? How much physical memory to you have? etc. Best, Erick On Mon, Dec 7, 2015 at 8:37 AM, philippa griggs wrote: > Hello, > > > I'm using: > > > Solr 5.2.1 10 shards each with a replica. (20 nodes in total) > > > Zookeeper 3.4.6. > > > About half a year ago we upgraded to Solr 5.2.1 and since then have been > experiencing a 'wipe out' effect where all of a sudden most if not all nodes > will go down. Sometimes they will recover by themselves but more often than > not we have to step in to restart nodes. > > > Nothing in the logs jumps out as being the problem. With the latest wipe out > we noticed that 10 out of the 20 nodes had garbage collections over 1min all > at the same time, with the heap usage spiking up in some cases to 80%. We > also noticed the amount of selects run on the solr cluster increased just > before the wipe out. > > > Increasing the heap size seems to help for a while but then it starts > happening again- so its more like a delay than a fix. Our GC settings are set > to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. > > > With our previous version of solr (4.10.0) this didn't happen. We had > nodes/shards go down but it was contained, with the new version they all seem > to go at around the same time. We can't really continue just increasing the > heap size and would like to solve this issue rather than delay it. > > > Has anyone experienced something simular? > > Is there a difference between the two versions around the recovery process? > > Does anyone have any suggestions on a fix. > > > Many thanks > > > Philippa >
Solr 5.2.1 Most solr nodes in a cluster going down at once.
Hello, I'm using: Solr 5.2.1 10 shards each with a replica. (20 nodes in total) Zookeeper 3.4.6. About half a year ago we upgraded to Solr 5.2.1 and since then have been experiencing a 'wipe out' effect where all of a sudden most if not all nodes will go down. Sometimes they will recover by themselves but more often than not we have to step in to restart nodes. Nothing in the logs jumps out as being the problem. With the latest wipe out we noticed that 10 out of the 20 nodes had garbage collections over 1min all at the same time, with the heap usage spiking up in some cases to 80%. We also noticed the amount of selects run on the solr cluster increased just before the wipe out. Increasing the heap size seems to help for a while but then it starts happening again- so its more like a delay than a fix. Our GC settings are set to -XX: +UseG1GC, -XX:+ParallelRefProcEnabled. With our previous version of solr (4.10.0) this didn't happen. We had nodes/shards go down but it was contained, with the new version they all seem to go at around the same time. We can't really continue just increasing the heap size and would like to solve this issue rather than delay it. Has anyone experienced something simular? Is there a difference between the two versions around the recovery process? Does anyone have any suggestions on a fix. Many thanks Philippa