Re: File Descriptor/Memory Leak
If this is reproducible, I would run the comparison under Wireshark (used to be called Ehtereal) https://www.wireshark.org/ . It would capture full network traffic and can even be run on a machine separate from either client or server (in promiscuous mode). Then, I would look at number of connections differences between HTTP and HTTPS for the same test. Perhaps HTTP is doing request pipelining and HTTPS does not. This would lead to more sockets (and more CLOSE_WAITs) for the same content. If the number of connection is the same, then I would pick a similar transaction and see the delays between the closing sequence FIN/SYN/whatever packets. If, after the server sends the closing packet, the client does not reply as fast with its own closing packet under HTTPS, then the problem is socket closing code. Obviously, SSL establishment of the connection is more painful/expensive than non-SSL, but the issue here is closing of one. This was the way I troubleshooted these scenarios many years ago as Weblogic senior tech support. I still think approaching this from network up is the most viable approach. Regards, Alex. Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 10 July 2016 at 17:05, Shai Erera <ser...@gmail.com> wrote: > There is no firewall and the CLOSE_WAITs are between Solr-to-Solr nodes > (the origin and destination IP:PORT belong to Solr). > > Also, note that the same test runs fine on 5.4.1, even though there are > still few hundreds of CLOSE_WAITs. I'm looking at what has changed in the > code between 5.4.1 and 5.5.1. It's also only reproducible when Solr is run > in SSL mode, so the problem might lie in HttpClient/Jetty too. > > Shai > > On Fri, Jul 8, 2016 at 11:59 AM Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > >> Is there a firewall between a client and a server by any chance? >> >> CLOSE_WAIT is not a leak, but standard TCP step at the end. So the question >> is why sockets are reopened that often or why the other side does not >> acknowledge TCP termination packet fast. >> >> I would run Ethereal to troubleshoot that. And truss/strace. >> >> Regards, >> Alex >> On 8 Jul 2016 4:56 PM, "Mads Tomasgård Bjørgan" <m...@dips.no> wrote: >> >> FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs >> every single time when running with SSL). >> >> -Original Message- >> From: Anshum Gupta [mailto:ans...@anshumgupta.net] >> Sent: torsdag 7. juli 2016 18.14 >> To: solr-user@lucene.apache.org >> Subject: Re: File Descriptor/Memory Leak >> >> I've created a JIRA to track this: >> https://issues.apache.org/jira/browse/SOLR-9290 >> >> On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <ser...@gmail.com> wrote: >> >> > Shalin, we're seeing that issue too (and actually actively debugging >> > it these days). So far I can confirm the following (on a 2-node cluster): >> > >> > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on >> > 5.4.1 >> > 2) It does not reproduce when SSL is disabled >> > 3) Restarting the Solr process (sometimes both need to be restarted), >> > the count drops to 0, but if indexing continues, they climb up again >> > >> > When it does happen, Solr seems stuck. The leader cannot talk to the >> > replica, or vice versa, the replica is usually put in DOWN state and >> > there's no way to fix it besides restarting the JVM. >> > >> > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that >> > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes >> > look legit. That did not help, and honestly I've done that before we >> > suspected it might be the SSL. Therefore I think those are "safe", but >> just FYI. >> > >> > When it does happen, the number of CLOSE_WAITS climb very high, to the >> > order of 30K+ entries in 'netstat'. >> > >> > When I say it does not reproduce on 5.4.1 I really mean the numbers >> > don't go as high as they do in 5.5.1. Meaning, when running without >> > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I >> > would separately like to understand why we have any in that state at >> > all). When running with SSL and 5.4.1, they stay low at the order of >> > hundreds the most. >> > >> > Unfortunately running without SSL is not an option for us. We will >> > likely roll back to 5.4.1, even if the problem exists there, but to a >> > lesser degree. >> > >> > I will post back here when/if we ha
Re: File Descriptor/Memory Leak
There is no firewall and the CLOSE_WAITs are between Solr-to-Solr nodes (the origin and destination IP:PORT belong to Solr). Also, note that the same test runs fine on 5.4.1, even though there are still few hundreds of CLOSE_WAITs. I'm looking at what has changed in the code between 5.4.1 and 5.5.1. It's also only reproducible when Solr is run in SSL mode, so the problem might lie in HttpClient/Jetty too. Shai On Fri, Jul 8, 2016 at 11:59 AM Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Is there a firewall between a client and a server by any chance? > > CLOSE_WAIT is not a leak, but standard TCP step at the end. So the question > is why sockets are reopened that often or why the other side does not > acknowledge TCP termination packet fast. > > I would run Ethereal to troubleshoot that. And truss/strace. > > Regards, > Alex > On 8 Jul 2016 4:56 PM, "Mads Tomasgård Bjørgan" <m...@dips.no> wrote: > > FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs > every single time when running with SSL). > > -Original Message- > From: Anshum Gupta [mailto:ans...@anshumgupta.net] > Sent: torsdag 7. juli 2016 18.14 > To: solr-user@lucene.apache.org > Subject: Re: File Descriptor/Memory Leak > > I've created a JIRA to track this: > https://issues.apache.org/jira/browse/SOLR-9290 > > On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <ser...@gmail.com> wrote: > > > Shalin, we're seeing that issue too (and actually actively debugging > > it these days). So far I can confirm the following (on a 2-node cluster): > > > > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on > > 5.4.1 > > 2) It does not reproduce when SSL is disabled > > 3) Restarting the Solr process (sometimes both need to be restarted), > > the count drops to 0, but if indexing continues, they climb up again > > > > When it does happen, Solr seems stuck. The leader cannot talk to the > > replica, or vice versa, the replica is usually put in DOWN state and > > there's no way to fix it besides restarting the JVM. > > > > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that > > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes > > look legit. That did not help, and honestly I've done that before we > > suspected it might be the SSL. Therefore I think those are "safe", but > just FYI. > > > > When it does happen, the number of CLOSE_WAITS climb very high, to the > > order of 30K+ entries in 'netstat'. > > > > When I say it does not reproduce on 5.4.1 I really mean the numbers > > don't go as high as they do in 5.5.1. Meaning, when running without > > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I > > would separately like to understand why we have any in that state at > > all). When running with SSL and 5.4.1, they stay low at the order of > > hundreds the most. > > > > Unfortunately running without SSL is not an option for us. We will > > likely roll back to 5.4.1, even if the problem exists there, but to a > > lesser degree. > > > > I will post back here when/if we have more info about this. > > > > Shai > > > > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar < > > shalinman...@gmail.com> > > wrote: > > > > > I have myself seen this CLOSE_WAIT issue at a customer. I am running > > > some tests with different versions trying to pinpoint the cause of this > leak. > > > Once I have some more information and a reproducible test, I'll open > > > a > > jira > > > issue. I'll keep you posted. > > > > > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <m...@dips.no> > > > wrote: > > > > > > > Hello there, > > > > Our SolrCloud is experiencing a FD leak while running with SSL. > > > > This is occurring on the one machine that our program is sending > > > > data too. We > > > have > > > > a total of three servers running as an ensemble. > > > > > > > > While running without SSL does the FD Count remain quite constant > > > > at around 180 while indexing. Performing a garbage collection also > > > > clears almost the entire JVM-memory. > > > > > > > > However - when indexing with SSL does the FDC grow polynomial. The > > count > > > > increases with a few hundred every five seconds or so, but reaches > > easily > > > > 50 000 within three to four minutes. Performing a GC swipes most > > > > of the memory on the two machines our program isn't transmitting > > > > the data > > > directly > > > > to. The last machine is unaffected by the GC, and both memory nor > > > > FDC doesn't reset before Solr is restarted on that machine. > > > > > > > > Performing a netstat reveals that the FDC mostly consists of > > > > TCP-connections in the state of "CLOSE_WAIT". > > > > > > > > > > > > > > > > > > > > > -- > > > Regards, > > > Shalin Shekhar Mangar. > > > > > > > > > -- > Anshum Gupta >
RE: File Descriptor/Memory Leak
Is there a firewall between a client and a server by any chance? CLOSE_WAIT is not a leak, but standard TCP step at the end. So the question is why sockets are reopened that often or why the other side does not acknowledge TCP termination packet fast. I would run Ethereal to troubleshoot that. And truss/strace. Regards, Alex On 8 Jul 2016 4:56 PM, "Mads Tomasgård Bjørgan" <m...@dips.no> wrote: FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs every single time when running with SSL). -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: torsdag 7. juli 2016 18.14 To: solr-user@lucene.apache.org Subject: Re: File Descriptor/Memory Leak I've created a JIRA to track this: https://issues.apache.org/jira/browse/SOLR-9290 On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <ser...@gmail.com> wrote: > Shalin, we're seeing that issue too (and actually actively debugging > it these days). So far I can confirm the following (on a 2-node cluster): > > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on > 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), > the count drops to 0, but if indexing continues, they climb up again > > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes > look legit. That did not help, and honestly I've done that before we > suspected it might be the SSL. Therefore I think those are "safe", but just FYI. > > When it does happen, the number of CLOSE_WAITS climb very high, to the > order of 30K+ entries in 'netstat'. > > When I say it does not reproduce on 5.4.1 I really mean the numbers > don't go as high as they do in 5.5.1. Meaning, when running without > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I > would separately like to understand why we have any in that state at > all). When running with SSL and 5.4.1, they stay low at the order of > hundreds the most. > > Unfortunately running without SSL is not an option for us. We will > likely roll back to 5.4.1, even if the problem exists there, but to a > lesser degree. > > I will post back here when/if we have more info about this. > > Shai > > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar < > shalinman...@gmail.com> > wrote: > > > I have myself seen this CLOSE_WAIT issue at a customer. I am running > > some tests with different versions trying to pinpoint the cause of this leak. > > Once I have some more information and a reproducible test, I'll open > > a > jira > > issue. I'll keep you posted. > > > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <m...@dips.no> > > wrote: > > > > > Hello there, > > > Our SolrCloud is experiencing a FD leak while running with SSL. > > > This is occurring on the one machine that our program is sending > > > data too. We > > have > > > a total of three servers running as an ensemble. > > > > > > While running without SSL does the FD Count remain quite constant > > > at around 180 while indexing. Performing a garbage collection also > > > clears almost the entire JVM-memory. > > > > > > However - when indexing with SSL does the FDC grow polynomial. The > count > > > increases with a few hundred every five seconds or so, but reaches > easily > > > 50 000 within three to four minutes. Performing a GC swipes most > > > of the memory on the two machines our program isn't transmitting > > > the data > > directly > > > to. The last machine is unaffected by the GC, and both memory nor > > > FDC doesn't reset before Solr is restarted on that machine. > > > > > > Performing a netstat reveals that the FDC mostly consists of > > > TCP-connections in the state of "CLOSE_WAIT". > > > > > > > > > > > > > > > -- > > Regards, > > Shalin Shekhar Mangar. > > > -- Anshum Gupta
RE: File Descriptor/Memory Leak
FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs every single time when running with SSL). -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: torsdag 7. juli 2016 18.14 To: solr-user@lucene.apache.org Subject: Re: File Descriptor/Memory Leak I've created a JIRA to track this: https://issues.apache.org/jira/browse/SOLR-9290 On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <ser...@gmail.com> wrote: > Shalin, we're seeing that issue too (and actually actively debugging > it these days). So far I can confirm the following (on a 2-node cluster): > > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on > 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), > the count drops to 0, but if indexing continues, they climb up again > > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes > look legit. That did not help, and honestly I've done that before we > suspected it might be the SSL. Therefore I think those are "safe", but just > FYI. > > When it does happen, the number of CLOSE_WAITS climb very high, to the > order of 30K+ entries in 'netstat'. > > When I say it does not reproduce on 5.4.1 I really mean the numbers > don't go as high as they do in 5.5.1. Meaning, when running without > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I > would separately like to understand why we have any in that state at > all). When running with SSL and 5.4.1, they stay low at the order of > hundreds the most. > > Unfortunately running without SSL is not an option for us. We will > likely roll back to 5.4.1, even if the problem exists there, but to a > lesser degree. > > I will post back here when/if we have more info about this. > > Shai > > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar < > shalinman...@gmail.com> > wrote: > > > I have myself seen this CLOSE_WAIT issue at a customer. I am running > > some tests with different versions trying to pinpoint the cause of this > > leak. > > Once I have some more information and a reproducible test, I'll open > > a > jira > > issue. I'll keep you posted. > > > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <m...@dips.no> > > wrote: > > > > > Hello there, > > > Our SolrCloud is experiencing a FD leak while running with SSL. > > > This is occurring on the one machine that our program is sending > > > data too. We > > have > > > a total of three servers running as an ensemble. > > > > > > While running without SSL does the FD Count remain quite constant > > > at around 180 while indexing. Performing a garbage collection also > > > clears almost the entire JVM-memory. > > > > > > However - when indexing with SSL does the FDC grow polynomial. The > count > > > increases with a few hundred every five seconds or so, but reaches > easily > > > 50 000 within three to four minutes. Performing a GC swipes most > > > of the memory on the two machines our program isn't transmitting > > > the data > > directly > > > to. The last machine is unaffected by the GC, and both memory nor > > > FDC doesn't reset before Solr is restarted on that machine. > > > > > > Performing a netstat reveals that the FDC mostly consists of > > > TCP-connections in the state of "CLOSE_WAIT". > > > > > > > > > > > > > > > -- > > Regards, > > Shalin Shekhar Mangar. > > > -- Anshum Gupta
Re: File Descriptor/Memory Leak
I've created a JIRA to track this: https://issues.apache.org/jira/browse/SOLR-9290 On Thu, Jul 7, 2016 at 8:00 AM, Shai Ererawrote: > Shalin, we're seeing that issue too (and actually actively debugging it > these days). So far I can confirm the following (on a 2-node cluster): > > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 > 2) It does not reproduce when SSL is disabled > 3) Restarting the Solr process (sometimes both need to be restarted), the > count drops to 0, but if indexing continues, they climb up again > > When it does happen, Solr seems stuck. The leader cannot talk to the > replica, or vice versa, the replica is usually put in DOWN state and > there's no way to fix it besides restarting the JVM. > > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes look > legit. That did not help, and honestly I've done that before we suspected > it might be the SSL. Therefore I think those are "safe", but just FYI. > > When it does happen, the number of CLOSE_WAITS climb very high, to the > order of 30K+ entries in 'netstat'. > > When I say it does not reproduce on 5.4.1 I really mean the numbers don't > go as high as they do in 5.5.1. Meaning, when running without SSL, the > number of CLOSE_WAITs is smallish, usually less than a 10 (I would > separately like to understand why we have any in that state at all). When > running with SSL and 5.4.1, they stay low at the order of hundreds the > most. > > Unfortunately running without SSL is not an option for us. We will likely > roll back to 5.4.1, even if the problem exists there, but to a lesser > degree. > > I will post back here when/if we have more info about this. > > Shai > > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar < > shalinman...@gmail.com> > wrote: > > > I have myself seen this CLOSE_WAIT issue at a customer. I am running some > > tests with different versions trying to pinpoint the cause of this leak. > > Once I have some more information and a reproducible test, I'll open a > jira > > issue. I'll keep you posted. > > > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan > > wrote: > > > > > Hello there, > > > Our SolrCloud is experiencing a FD leak while running with SSL. This is > > > occurring on the one machine that our program is sending data too. We > > have > > > a total of three servers running as an ensemble. > > > > > > While running without SSL does the FD Count remain quite constant at > > > around 180 while indexing. Performing a garbage collection also clears > > > almost the entire JVM-memory. > > > > > > However - when indexing with SSL does the FDC grow polynomial. The > count > > > increases with a few hundred every five seconds or so, but reaches > easily > > > 50 000 within three to four minutes. Performing a GC swipes most of the > > > memory on the two machines our program isn't transmitting the data > > directly > > > to. The last machine is unaffected by the GC, and both memory nor FDC > > > doesn't reset before Solr is restarted on that machine. > > > > > > Performing a netstat reveals that the FDC mostly consists of > > > TCP-connections in the state of "CLOSE_WAIT". > > > > > > > > > > > > > > > -- > > Regards, > > Shalin Shekhar Mangar. > > > -- Anshum Gupta
Re: File Descriptor/Memory Leak
Shalin, we're seeing that issue too (and actually actively debugging it these days). So far I can confirm the following (on a 2-node cluster): 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1 2) It does not reproduce when SSL is disabled 3) Restarting the Solr process (sometimes both need to be restarted), the count drops to 0, but if indexing continues, they climb up again When it does happen, Solr seems stuck. The leader cannot talk to the replica, or vice versa, the replica is usually put in DOWN state and there's no way to fix it besides restarting the JVM. Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that looked suspicious (SOLR-8451 and SOLR-8578), even though the changes look legit. That did not help, and honestly I've done that before we suspected it might be the SSL. Therefore I think those are "safe", but just FYI. When it does happen, the number of CLOSE_WAITS climb very high, to the order of 30K+ entries in 'netstat'. When I say it does not reproduce on 5.4.1 I really mean the numbers don't go as high as they do in 5.5.1. Meaning, when running without SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I would separately like to understand why we have any in that state at all). When running with SSL and 5.4.1, they stay low at the order of hundreds the most. Unfortunately running without SSL is not an option for us. We will likely roll back to 5.4.1, even if the problem exists there, but to a lesser degree. I will post back here when/if we have more info about this. Shai On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangarwrote: > I have myself seen this CLOSE_WAIT issue at a customer. I am running some > tests with different versions trying to pinpoint the cause of this leak. > Once I have some more information and a reproducible test, I'll open a jira > issue. I'll keep you posted. > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan > wrote: > > > Hello there, > > Our SolrCloud is experiencing a FD leak while running with SSL. This is > > occurring on the one machine that our program is sending data too. We > have > > a total of three servers running as an ensemble. > > > > While running without SSL does the FD Count remain quite constant at > > around 180 while indexing. Performing a garbage collection also clears > > almost the entire JVM-memory. > > > > However - when indexing with SSL does the FDC grow polynomial. The count > > increases with a few hundred every five seconds or so, but reaches easily > > 50 000 within three to four minutes. Performing a GC swipes most of the > > memory on the two machines our program isn't transmitting the data > directly > > to. The last machine is unaffected by the GC, and both memory nor FDC > > doesn't reset before Solr is restarted on that machine. > > > > Performing a netstat reveals that the FDC mostly consists of > > TCP-connections in the state of "CLOSE_WAIT". > > > > > > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: File Descriptor/Memory Leak
I have myself seen this CLOSE_WAIT issue at a customer. I am running some tests with different versions trying to pinpoint the cause of this leak. Once I have some more information and a reproducible test, I'll open a jira issue. I'll keep you posted. On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørganwrote: > Hello there, > Our SolrCloud is experiencing a FD leak while running with SSL. This is > occurring on the one machine that our program is sending data too. We have > a total of three servers running as an ensemble. > > While running without SSL does the FD Count remain quite constant at > around 180 while indexing. Performing a garbage collection also clears > almost the entire JVM-memory. > > However - when indexing with SSL does the FDC grow polynomial. The count > increases with a few hundred every five seconds or so, but reaches easily > 50 000 within three to four minutes. Performing a GC swipes most of the > memory on the two machines our program isn't transmitting the data directly > to. The last machine is unaffected by the GC, and both memory nor FDC > doesn't reset before Solr is restarted on that machine. > > Performing a netstat reveals that the FDC mostly consists of > TCP-connections in the state of "CLOSE_WAIT". > > > -- Regards, Shalin Shekhar Mangar.
File Descriptor/Memory Leak
Hello there, Our SolrCloud is experiencing a FD leak while running with SSL. This is occurring on the one machine that our program is sending data too. We have a total of three servers running as an ensemble. While running without SSL does the FD Count remain quite constant at around 180 while indexing. Performing a garbage collection also clears almost the entire JVM-memory. However - when indexing with SSL does the FDC grow polynomial. The count increases with a few hundred every five seconds or so, but reaches easily 50 000 within three to four minutes. Performing a GC swipes most of the memory on the two machines our program isn't transmitting the data directly to. The last machine is unaffected by the GC, and both memory nor FDC doesn't reset before Solr is restarted on that machine. Performing a netstat reveals that the FDC mostly consists of TCP-connections in the state of "CLOSE_WAIT".