CDCR - Active-passive model
Hello, I read that it's being worked on 6x to fix the limitation of CDCR only covering the active-passive scenario. My question is then - does anyone know when we can expect the fix to be out? Thanks, Mads
RE: File Descriptor/Memory Leak
FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs every single time when running with SSL). -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: torsdag 7. juli 2016 18.14 To: solr-user@lucene.apache.org Subject: Re: File Descriptor/Memory Leak I've created a JIRA to track this: https://issues.apache.org/jira/browse/SOLR-9290 On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <ser...@gmail.com> wrote: > Shalin, we're seeing that issue too (and actually actively debugging > it these days). So far I can confirm the following (on a 2-node cluster): > > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on > 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), > the count drops to 0, but if indexing continues, they climb up again > > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes > look legit. That did not help, and honestly I've done that before we > suspected it might be the SSL. Therefore I think those are "safe", but just > FYI. > > When it does happen, the number of CLOSE_WAITS climb very high, to the > order of 30K+ entries in 'netstat'. > > When I say it does not reproduce on 5.4.1 I really mean the numbers > don't go as high as they do in 5.5.1. Meaning, when running without > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I > would separately like to understand why we have any in that state at > all). When running with SSL and 5.4.1, they stay low at the order of > hundreds the most. > > Unfortunately running without SSL is not an option for us. We will > likely roll back to 5.4.1, even if the problem exists there, but to a > lesser degree. > > I will post back here when/if we have more info about this. > > Shai > > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar < > shalinman...@gmail.com> > wrote: > > > I have myself seen this CLOSE_WAIT issue at a customer. I am running > > some tests with different versions trying to pinpoint the cause of this > > leak. > > Once I have some more information and a reproducible test, I'll open > > a > jira > > issue. I'll keep you posted. > > > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <m...@dips.no> > > wrote: > > > > > Hello there, > > > Our SolrCloud is experiencing a FD leak while running with SSL. > > > This is occurring on the one machine that our program is sending > > > data too. We > > have > > > a total of three servers running as an ensemble. > > > > > > While running without SSL does the FD Count remain quite constant > > > at around 180 while indexing. Performing a garbage collection also > > > clears almost the entire JVM-memory. > > > > > > However - when indexing with SSL does the FDC grow polynomial. The > count > > > increases with a few hundred every five seconds or so, but reaches > easily > > > 50 000 within three to four minutes. Performing a GC swipes most > > > of the memory on the two machines our program isn't transmitting > > > the data > > directly > > > to. The last machine is unaffected by the GC, and both memory nor > > > FDC doesn't reset before Solr is restarted on that machine. > > > > > > Performing a netstat reveals that the FDC mostly consists of > > > TCP-connections in the state of "CLOSE_WAIT". > > > > > > > > > > > > > > > -- > > Regards, > > Shalin Shekhar Mangar. > > > -- Anshum Gupta
File Descriptor/Memory Leak
Hello there, Our SolrCloud is experiencing a FD leak while running with SSL. This is occurring on the one machine that our program is sending data too. We have a total of three servers running as an ensemble. While running without SSL does the FD Count remain quite constant at around 180 while indexing. Performing a garbage collection also clears almost the entire JVM-memory. However - when indexing with SSL does the FDC grow polynomial. The count increases with a few hundred every five seconds or so, but reaches easily 50 000 within three to four minutes. Performing a GC swipes most of the memory on the two machines our program isn't transmitting the data directly to. The last machine is unaffected by the GC, and both memory nor FDC doesn't reset before Solr is restarted on that machine. Performing a netstat reveals that the FDC mostly consists of TCP-connections in the state of "CLOSE_WAIT".
RE: Memory issues when indexing
Another update: After creating a new certificate, properly specified for its use of context, do we still end up in the described situation. Thus, it seems SSL itself is the underlying reason for the leak - -Original Message- From: Mads Tomasgård Bjørgan [mailto:m...@dips.no] Sent: tirsdag 5. juli 2016 10.36 To: solr-user@lucene.apache.org Subject: RE: Memory issues when indexing Hi again, We turned off SSL - and now everything works as normal. The certificate is not originally meant for being used on the current servers- but we would like to keep it as the certificate has been deployed already and used by our customers. Thus we need to launch the cloud with "-Dsolr.ssl.checkPeerName=false" - but it seems quite obvious that the nodes still can't communicate properly. Our last resort is to replace the certificate - so the questions is now whether it is possible to tweak the configuration so that we can deploy the configuration so that we can deploy a SolrCloud with the same certificate. Thanks, Mads From: Mads Tomasgård Bjørgan [mailto:m...@dips.no] Sent: tirsdag 5. juli 2016 09.46 To: solr-user@lucene.apache.org Subject: Memory issues when indexing Hello, We're struggling with memory-issues when posting documents to Solr - and unsure for which reason the problem occurs. The documents are indexed in a SolrCloud running Solr 6.1.0 on top of Zookeeper 3.4.8, utilizing three VMs running CentOS 7 and JRE 1.8.0. After various attempts with different configurations the heap always got full on one, and only one, of the machines (let's call this machine 1) - and in the end yielding the following exception: () o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during distributed update: Cannot assign requested address The remaining two machines always has a lot of free memory compared with machine 1. Thus, we decided to only index a small fraction of the documents to see whether the exception was due to memory limitations or not. We stopped the indexation when the memory of machine 1 reached 2,5GB of a total of 4GB. As seen on the picture from JConsole was machine 2 only using 1,4GB of the available memory at the same time (same goes for machine 3). The indexation stopped - and both machine 2 and 3 had most of their memory emptied when performing a Garbage Collection. However - machine 1 was unaffected, and very little memory was freed which means Solr still used around 2,5GB of the memory. I would assume the memory of machine 1 would be emptied in a similar manner as with machine 2 and 3 as the indexation was stopped. Most of the memory belonged to the memory pool of "CMS Old Gen" (well above 2GB). Indexing until the memory is full for machine 1 gives a count of 50 000 in "File Descriptor Count" - while the number of files in the index folder is around 150 for each node. I was told that the number of files in the index folder and the file descriptor count should be matching? Machine 1 has an enormous amount of TCP-connections stalling at CLOSE_WAIT - while machine 2 and 3 doesn't have their respective FIN_WAITs even tough machine 1 has almost all of his TCP-connections pointing at those machines. [cid:image001.png@01D1D6A1.C6B71120][cid:image002.png@01D1D6A1.C6B71120] JConsole pictures for machine 1 and 2, respectively. At 08:45 did we resume indexation - the same exception as shown above was given around 08:52. Machine 2 cleans most of the memory at GC - in contrast to machine 1. We have no idea whether this is a bug or fault in the configuration - and was hoping someone could provide aid to our problem. Greetings, Mads
RE: Memory issues when indexing
Hi again, We turned off SSL - and now everything works as normal. The certificate is not originally meant for being used on the current servers- but we would like to keep it as the certificate has been deployed already and used by our customers. Thus we need to launch the cloud with "-Dsolr.ssl.checkPeerName=false" - but it seems quite obvious that the nodes still can't communicate properly. Our last resort is to replace the certificate - so the questions is now whether it is possible to tweak the configuration so that we can deploy the configuration so that we can deploy a SolrCloud with the same certificate. Thanks, Mads From: Mads Tomasgård Bjørgan [mailto:m...@dips.no] Sent: tirsdag 5. juli 2016 09.46 To: solr-user@lucene.apache.org Subject: Memory issues when indexing Hello, We're struggling with memory-issues when posting documents to Solr - and unsure for which reason the problem occurs. The documents are indexed in a SolrCloud running Solr 6.1.0 on top of Zookeeper 3.4.8, utilizing three VMs running CentOS 7 and JRE 1.8.0. After various attempts with different configurations the heap always got full on one, and only one, of the machines (let's call this machine 1) - and in the end yielding the following exception: () o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during distributed update: Cannot assign requested address The remaining two machines always has a lot of free memory compared with machine 1. Thus, we decided to only index a small fraction of the documents to see whether the exception was due to memory limitations or not. We stopped the indexation when the memory of machine 1 reached 2,5GB of a total of 4GB. As seen on the picture from JConsole was machine 2 only using 1,4GB of the available memory at the same time (same goes for machine 3). The indexation stopped - and both machine 2 and 3 had most of their memory emptied when performing a Garbage Collection. However - machine 1 was unaffected, and very little memory was freed which means Solr still used around 2,5GB of the memory. I would assume the memory of machine 1 would be emptied in a similar manner as with machine 2 and 3 as the indexation was stopped. Most of the memory belonged to the memory pool of "CMS Old Gen" (well above 2GB). Indexing until the memory is full for machine 1 gives a count of 50 000 in "File Descriptor Count" - while the number of files in the index folder is around 150 for each node. I was told that the number of files in the index folder and the file descriptor count should be matching? Machine 1 has an enormous amount of TCP-connections stalling at CLOSE_WAIT - while machine 2 and 3 doesn't have their respective FIN_WAITs even tough machine 1 has almost all of his TCP-connections pointing at those machines. [cid:image001.png@01D1D6A1.C6B71120][cid:image002.png@01D1D6A1.C6B71120] JConsole pictures for machine 1 and 2, respectively. At 08:45 did we resume indexation - the same exception as shown above was given around 08:52. Machine 2 cleans most of the memory at GC - in contrast to machine 1. We have no idea whether this is a bug or fault in the configuration - and was hoping someone could provide aid to our problem. Greetings, Mads
Memory issues when indexing
Hello, We're struggling with memory-issues when posting documents to Solr - and unsure for which reason the problem occurs. The documents are indexed in a SolrCloud running Solr 6.1.0 on top of Zookeeper 3.4.8, utilizing three VMs running CentOS 7 and JRE 1.8.0. After various attempts with different configurations the heap always got full on one, and only one, of the machines (let's call this machine 1) - and in the end yielding the following exception: () o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during distributed update: Cannot assign requested address The remaining two machines always has a lot of free memory compared with machine 1. Thus, we decided to only index a small fraction of the documents to see whether the exception was due to memory limitations or not. We stopped the indexation when the memory of machine 1 reached 2,5GB of a total of 4GB. As seen on the picture from JConsole was machine 2 only using 1,4GB of the available memory at the same time (same goes for machine 3). The indexation stopped - and both machine 2 and 3 had most of their memory emptied when performing a Garbage Collection. However - machine 1 was unaffected, and very little memory was freed which means Solr still used around 2,5GB of the memory. I would assume the memory of machine 1 would be emptied in a similar manner as with machine 2 and 3 as the indexation was stopped. Most of the memory belonged to the memory pool of "CMS Old Gen" (well above 2GB). Indexing until the memory is full for machine 1 gives a count of 50 000 in "File Descriptor Count" - while the number of files in the index folder is around 150 for each node. I was told that the number of files in the index folder and the file descriptor count should be matching? Machine 1 has an enormous amount of TCP-connections stalling at CLOSE_WAIT - while machine 2 and 3 doesn't have their respective FIN_WAITs even tough machine 1 has almost all of his TCP-connections pointing at those machines. [cid:image001.png@01D1D6A1.C6B71120][cid:image002.png@01D1D6A1.C6B71120] JConsole pictures for machine 1 and 2, respectively. At 08:45 did we resume indexation - the same exception as shown above was given around 08:52. Machine 2 cleans most of the memory at GC - in contrast to machine 1. We have no idea whether this is a bug or fault in the configuration - and was hoping someone could provide aid to our problem. Greetings, Mads
RE: Solr node crashes while indexing - Too many open files
That's true, but I was hoping there would be another way to solve this issue as it's not considered preferable in our situation. Is it normal behavior for Solr to open over 4000 files without closing them properly? Is it for example possible to adjust autoCommit-settings I solrconfig.xml for forcing Solr to close the files? Any help is appreciated :-) -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: torsdag 30. juni 2016 11.41 To: solr-user@lucene.apache.org Subject: RE: Solr node crashes while indexing - Too many open files Mads, some distributions require different steps for increasing max_open_files. Check how it works vor CentOS specifically. Markus -Original message- > From:Mads Tomasgård Bjørgan> Sent: Thursday 30th June 2016 10:52 > To: solr-user@lucene.apache.org > Subject: Solr node crashes while indexing - Too many open files > > Hello, > We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by > utilizing ZooKeeper 3.4.8. > > We have two ensembles - and both clusters are running on three of their own > respective VMs (CentOS 7). We first thought the error was due to CDCR - as we > were trying to index a large amount of documents which had to be replicated > to the target cluster. However, we got the same error even after turning of > CDCR - which indicates CDCR wasn't the problem after all. > > After indexing between 20 000 to 35 000 documents to the source cluster does > the File Descriptor Count reach 4096 for one of the solr-nodes - and the > respective node crashes. The count grows quite linearly as time goes. The > remaining 2 nodes in the cluster is not affected at all, and their logs had > no relevant posts. We found the following errors for the crashing node in > its log: > > 2016-06-30 08:23:12.459 ERROR > (updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1 > x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) > [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] > o.a.s.u.StreamingSolrClients error > java.net.SocketException: Too many open files > (...) > 2016-06-30 08:23:12.460 ERROR > (updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1 > x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) > [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] > o.a.s.u.StreamingSolrClients error > java.net.SocketException: Too many open files > (...) > 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 > x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase > org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: > 2 Async exceptions during distributed update: > Too many open files > Too many open files > (...) > 2016-06-30 08:23:12.461 INFO (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 > x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1] > webapp=/solr path=/update params={version=2.2} status=-1 QTime=5 > 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 > x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall > null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: > 2 Async exceptions during distributed update: > Too many open files > Too many open files > () > > 2016-06-30 08:23:12.461 WARN (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 > x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1 > 2016-06-30 08:23:38.108 INFO (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 > x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1] > webapp=/solr path=/select > params={df=_text_=false=id=score=4=0=true=https://10.0.106.115:443/solr/DIPS_shard1_replica1/=10=2=*:*=1467275018057=true=javabin&_=1467275017220} > hits=30218 status=0 QTime=1 > > Running netstat -n -p on the VM that yields the exceptions reveals that there > is at least 1 800 TCP connections (not counted how many - the netstat command > filled the entire PuTTY window yielding 2 000 lines) waiting to be closed: > tcp6 70 0 10.0.106.115:34531 10.0.106.114:443 > CLOSE_WAIT 21658/java > We're running the SolrCloud on 443, and the IP's belong to the VMs. We also > tried adjusting the ulimit for the machine to 100 000 - without any results.. > > Greetings, > Mads >
Solr node crashes while indexing - Too many open files
Hello, We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by utilizing ZooKeeper 3.4.8. We have two ensembles - and both clusters are running on three of their own respective VMs (CentOS 7). We first thought the error was due to CDCR - as we were trying to index a large amount of documents which had to be replicated to the target cluster. However, we got the same error even after turning of CDCR - which indicates CDCR wasn't the problem after all. After indexing between 20 000 to 35 000 documents to the source cluster does the File Descriptor Count reach 4096 for one of the solr-nodes - and the respective node crashes. The count grows quite linearly as time goes. The remaining 2 nodes in the cluster is not affected at all, and their logs had no relevant posts. We found the following errors for the crashing node in its log: 2016-06-30 08:23:12.459 ERROR (updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error java.net.SocketException: Too many open files (...) 2016-06-30 08:23:12.460 ERROR (updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error java.net.SocketException: Too many open files (...) 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update: Too many open files Too many open files (...) 2016-06-30 08:23:12.461 INFO (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1] webapp=/solr path=/update params={version=2.2} status=-1 QTime=5 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update: Too many open files Too many open files () 2016-06-30 08:23:12.461 WARN (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1 2016-06-30 08:23:38.108 INFO (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1] webapp=/solr path=/select params={df=_text_=false=id=score=4=0=true=https://10.0.106.115:443/solr/DIPS_shard1_replica1/=10=2=*:*=1467275018057=true=javabin&_=1467275017220} hits=30218 status=0 QTime=1 Running netstat -n -p on the VM that yields the exceptions reveals that there is at least 1 800 TCP connections (not counted how many - the netstat command filled the entire PuTTY window yielding 2 000 lines) waiting to be closed: tcp6 70 0 10.0.106.115:34531 10.0.106.114:443CLOSE_WAIT 21658/java We're running the SolrCloud on 443, and the IP's belong to the VMs. We also tried adjusting the ulimit for the machine to 100 000 - without any results.. Greetings, Mads