CDCR - Active-passive model

2016-08-03 Thread Mads Tomasgård Bjørgan
Hello,
I read that it's being worked on 6x to fix the limitation of CDCR only covering 
the active-passive scenario. My question is then - does anyone know when we can 
expect the fix to be out?

Thanks,
Mads


RE: File Descriptor/Memory Leak

2016-07-08 Thread Mads Tomasgård Bjørgan
FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs every 
single time when running with SSL).

-Original Message-
From: Anshum Gupta [mailto:ans...@anshumgupta.net] 
Sent: torsdag 7. juli 2016 18.14
To: solr-user@lucene.apache.org
Subject: Re: File Descriptor/Memory Leak

I've created a JIRA to track this:
https://issues.apache.org/jira/browse/SOLR-9290

On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <ser...@gmail.com> wrote:

> Shalin, we're seeing that issue too (and actually actively debugging 
> it these days). So far I can confirm the following (on a 2-node cluster):
>
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 
> 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), 
> the count drops to 0, but if indexing continues, they climb up again
>
> When it does happen, Solr seems stuck. The leader cannot talk to the 
> replica, or vice versa, the replica is usually put in DOWN state and 
> there's no way to fix it besides restarting the JVM.
>
> Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that 
> looked suspicious (SOLR-8451 and SOLR-8578), even though the changes 
> look legit. That did not help, and honestly I've done that before we 
> suspected it might be the SSL. Therefore I think those are "safe", but just 
> FYI.
>
> When it does happen, the number of CLOSE_WAITS climb very high, to the 
> order of 30K+ entries in 'netstat'.
>
> When I say it does not reproduce on 5.4.1 I really mean the numbers 
> don't go as high as they do in 5.5.1. Meaning, when running without 
> SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I 
> would separately like to understand why we have any in that state at 
> all). When running with SSL and 5.4.1, they stay low at the order of 
> hundreds the most.
>
> Unfortunately running without SSL is not an option for us. We will 
> likely roll back to 5.4.1, even if the problem exists there, but to a 
> lesser degree.
>
> I will post back here when/if we have more info about this.
>
> Shai
>
> On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar < 
> shalinman...@gmail.com>
> wrote:
>
> > I have myself seen this CLOSE_WAIT issue at a customer. I am running 
> > some tests with different versions trying to pinpoint the cause of this 
> > leak.
> > Once I have some more information and a reproducible test, I'll open 
> > a
> jira
> > issue. I'll keep you posted.
> >
> > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <m...@dips.no>
> > wrote:
> >
> > > Hello there,
> > > Our SolrCloud is experiencing a FD leak while running with SSL. 
> > > This is occurring on the one machine that our program is sending 
> > > data too. We
> > have
> > > a total of three servers running as an ensemble.
> > >
> > > While running without SSL does the FD Count remain quite constant 
> > > at around 180 while indexing. Performing a garbage collection also 
> > > clears almost the entire JVM-memory.
> > >
> > > However - when indexing with SSL does the FDC grow polynomial. The
> count
> > > increases with a few hundred every five seconds or so, but reaches
> easily
> > > 50 000 within three to four minutes. Performing a GC swipes most 
> > > of the memory on the two machines our program isn't transmitting 
> > > the data
> > directly
> > > to. The last machine is unaffected by the GC, and both memory nor 
> > > FDC doesn't reset before Solr is restarted on that machine.
> > >
> > > Performing a netstat reveals that the FDC mostly consists of 
> > > TCP-connections in the state of "CLOSE_WAIT".
> > >
> > >
> > >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>



--
Anshum Gupta


File Descriptor/Memory Leak

2016-07-07 Thread Mads Tomasgård Bjørgan
Hello there,
Our SolrCloud is experiencing a FD leak while running with SSL. This is 
occurring on the one machine that our program is sending data too. We have a 
total of three servers running as an ensemble.

While running without SSL does the FD Count remain quite constant at around 180 
while indexing. Performing a garbage collection also clears almost the entire 
JVM-memory.

However - when indexing with SSL does the FDC grow polynomial. The count 
increases with a few hundred every five seconds or so, but reaches easily 50 
000 within three to four minutes. Performing a GC swipes most of the memory on 
the two machines our program isn't transmitting the data directly to. The last 
machine is unaffected by the GC, and both memory nor FDC doesn't reset before 
Solr is restarted on that machine.

Performing a netstat reveals that the FDC mostly consists of TCP-connections in 
the state of "CLOSE_WAIT".




RE: Memory issues when indexing

2016-07-05 Thread Mads Tomasgård Bjørgan
Another update:

After creating a new certificate, properly specified for its use of context, do 
we still end up in the described situation. Thus, it seems SSL itself is the 
underlying reason for the leak - 

-Original Message-
From: Mads Tomasgård Bjørgan [mailto:m...@dips.no] 
Sent: tirsdag 5. juli 2016 10.36
To: solr-user@lucene.apache.org
Subject: RE: Memory issues when indexing

Hi again,
We turned off SSL - and now everything works as normal.

The certificate is not originally meant for being used on the current servers- 
but we would like to keep it as the certificate has been deployed already and 
used by our customers. Thus we need to launch the cloud with 
"-Dsolr.ssl.checkPeerName=false" - but it seems quite obvious that the nodes 
still can't communicate properly.

Our last resort is to replace the certificate - so the questions is now whether 
it is possible to tweak the configuration so that we can deploy the 
configuration so that we can deploy a SolrCloud with the same certificate.

Thanks,
Mads

From: Mads Tomasgård Bjørgan [mailto:m...@dips.no]
Sent: tirsdag 5. juli 2016 09.46
To: solr-user@lucene.apache.org
Subject: Memory issues when indexing

Hello,
We're struggling with memory-issues when posting documents to Solr - and unsure 
for which reason the problem occurs.

The documents are indexed in a SolrCloud running Solr 6.1.0 on top of Zookeeper 
3.4.8, utilizing three VMs running CentOS 7 and JRE 1.8.0.

After various attempts with different configurations the heap always got full 
on one, and only one, of the machines (let's call this machine 1) - and in the 
end yielding the following exception:
() o.a.s.s.HttpSolrCall 
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
 Async exception during distributed update: Cannot assign requested address The 
remaining two machines always has a lot of free memory compared with machine 1.

Thus, we decided to only index a small fraction of the documents to see whether 
the exception was due to memory limitations or not. We stopped the indexation 
when the memory of machine 1 reached 2,5GB of a total of 4GB. As seen on the 
picture from JConsole was machine 2 only using 1,4GB of the available memory at 
the same time (same goes for machine 3). The indexation stopped - and both 
machine 2 and 3 had most of their memory emptied when performing a Garbage 
Collection. However - machine 1 was unaffected, and very little memory was 
freed which means Solr still used around 2,5GB of the memory. I would assume 
the memory of machine 1 would be emptied in a similar manner as with machine 2 
and 3 as the indexation was stopped. Most of the memory belonged to the memory 
pool of "CMS Old Gen"  (well above 2GB).

Indexing until the memory is full for machine 1 gives a count of 50 000 in 
"File Descriptor Count" - while the number of files in the index folder is 
around 150 for each node. I was told that the number of files in the index 
folder and the file descriptor count should be matching? Machine 1 has an 
enormous amount of TCP-connections stalling at CLOSE_WAIT - while machine 2 and 
3 doesn't have their respective FIN_WAITs even tough machine 1 has almost all 
of his TCP-connections pointing at those machines.

[cid:image001.png@01D1D6A1.C6B71120][cid:image002.png@01D1D6A1.C6B71120]
JConsole pictures for machine 1 and 2, respectively. At 08:45 did we resume 
indexation - the same exception as shown above was given around 08:52. Machine 
2 cleans most of the memory at GC - in contrast to machine 1.


We have no idea whether this is a bug or fault in the configuration - and was 
hoping someone could provide aid to our problem.

Greetings,
Mads


RE: Memory issues when indexing

2016-07-05 Thread Mads Tomasgård Bjørgan
Hi again,
We turned off SSL - and now everything works as normal.

The certificate is not originally meant for being used on the current servers- 
but we would like to keep it as the certificate has been deployed already and 
used by our customers. Thus we need to launch the cloud with 
"-Dsolr.ssl.checkPeerName=false" - but it seems quite obvious that the nodes 
still can't communicate properly.

Our last resort is to replace the certificate - so the questions is now whether 
it is possible to tweak the configuration so that we can deploy the 
configuration so that we can deploy a SolrCloud with the same certificate.

Thanks,
Mads

From: Mads Tomasgård Bjørgan [mailto:m...@dips.no]
Sent: tirsdag 5. juli 2016 09.46
To: solr-user@lucene.apache.org
Subject: Memory issues when indexing

Hello,
We're struggling with memory-issues when posting documents to Solr - and unsure 
for which reason the problem occurs.

The documents are indexed in a SolrCloud running Solr 6.1.0 on top of Zookeeper 
3.4.8, utilizing three VMs running CentOS 7 and JRE 1.8.0.

After various attempts with different configurations the heap always got full 
on one, and only one, of the machines (let's call this machine 1) - and in the 
end yielding the following exception:
() o.a.s.s.HttpSolrCall 
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
 Async exception during distributed update: Cannot assign requested address
The remaining two machines always has a lot of free memory compared with 
machine 1.

Thus, we decided to only index a small fraction of the documents to see whether 
the exception was due to memory limitations or not. We stopped the indexation 
when the memory of machine 1 reached 2,5GB of a total of 4GB. As seen on the 
picture from JConsole was machine 2 only using 1,4GB of the available memory at 
the same time (same goes for machine 3). The indexation stopped - and both 
machine 2 and 3 had most of their memory emptied when performing a Garbage 
Collection. However - machine 1 was unaffected, and very little memory was 
freed which means Solr still used around 2,5GB of the memory. I would assume 
the memory of machine 1 would be emptied in a similar manner as with machine 2 
and 3 as the indexation was stopped. Most of the memory belonged to the memory 
pool of "CMS Old Gen"  (well above 2GB).

Indexing until the memory is full for machine 1 gives a count of 50 000 in 
"File Descriptor Count" - while the number of files in the index folder is 
around 150 for each node. I was told that the number of files in the index 
folder and the file descriptor count should be matching? Machine 1 has an 
enormous amount of TCP-connections stalling at CLOSE_WAIT - while machine 2 and 
3 doesn't have their respective FIN_WAITs even tough machine 1 has almost all 
of his TCP-connections pointing at those machines.

[cid:image001.png@01D1D6A1.C6B71120][cid:image002.png@01D1D6A1.C6B71120]
JConsole pictures for machine 1 and 2, respectively. At 08:45 did we resume 
indexation - the same exception as shown above was given around 08:52. Machine 
2 cleans most of the memory at GC - in contrast to machine 1.


We have no idea whether this is a bug or fault in the configuration - and was 
hoping someone could provide aid to our problem.

Greetings,
Mads


Memory issues when indexing

2016-07-05 Thread Mads Tomasgård Bjørgan
Hello,
We're struggling with memory-issues when posting documents to Solr - and unsure 
for which reason the problem occurs.

The documents are indexed in a SolrCloud running Solr 6.1.0 on top of Zookeeper 
3.4.8, utilizing three VMs running CentOS 7 and JRE 1.8.0.

After various attempts with different configurations the heap always got full 
on one, and only one, of the machines (let's call this machine 1) - and in the 
end yielding the following exception:
() o.a.s.s.HttpSolrCall 
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
 Async exception during distributed update: Cannot assign requested address
The remaining two machines always has a lot of free memory compared with 
machine 1.

Thus, we decided to only index a small fraction of the documents to see whether 
the exception was due to memory limitations or not. We stopped the indexation 
when the memory of machine 1 reached 2,5GB of a total of 4GB. As seen on the 
picture from JConsole was machine 2 only using 1,4GB of the available memory at 
the same time (same goes for machine 3). The indexation stopped - and both 
machine 2 and 3 had most of their memory emptied when performing a Garbage 
Collection. However - machine 1 was unaffected, and very little memory was 
freed which means Solr still used around 2,5GB of the memory. I would assume 
the memory of machine 1 would be emptied in a similar manner as with machine 2 
and 3 as the indexation was stopped. Most of the memory belonged to the memory 
pool of "CMS Old Gen"  (well above 2GB).

Indexing until the memory is full for machine 1 gives a count of 50 000 in 
"File Descriptor Count" - while the number of files in the index folder is 
around 150 for each node. I was told that the number of files in the index 
folder and the file descriptor count should be matching? Machine 1 has an 
enormous amount of TCP-connections stalling at CLOSE_WAIT - while machine 2 and 
3 doesn't have their respective FIN_WAITs even tough machine 1 has almost all 
of his TCP-connections pointing at those machines.

[cid:image001.png@01D1D6A1.C6B71120][cid:image002.png@01D1D6A1.C6B71120]
JConsole pictures for machine 1 and 2, respectively. At 08:45 did we resume 
indexation - the same exception as shown above was given around 08:52. Machine 
2 cleans most of the memory at GC - in contrast to machine 1.


We have no idea whether this is a bug or fault in the configuration - and was 
hoping someone could provide aid to our problem.

Greetings,
Mads


RE: Solr node crashes while indexing - Too many open files

2016-06-30 Thread Mads Tomasgård Bjørgan
That's true, but I was hoping there would be another way to solve this issue as 
it's not considered preferable in our situation.

Is it normal behavior for Solr to open over 4000 files without closing them 
properly? Is it for example possible to adjust autoCommit-settings I 
solrconfig.xml for forcing Solr to close the files?

Any help is appreciated :-)

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: torsdag 30. juni 2016 11.41
To: solr-user@lucene.apache.org
Subject: RE: Solr node crashes while indexing - Too many open files

Mads, some distributions require different steps for increasing max_open_files. 
Check how it works vor CentOS specifically.

Markus

 
 
-Original message-
> From:Mads Tomasgård Bjørgan 
> Sent: Thursday 30th June 2016 10:52
> To: solr-user@lucene.apache.org
> Subject: Solr node crashes while indexing - Too many open files
> 
> Hello,
> We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by 
> utilizing ZooKeeper 3.4.8.
> 
> We have two ensembles - and both clusters are running on three of their own 
> respective VMs (CentOS 7). We first thought the error was due to CDCR - as we 
> were trying to index a large amount of documents which had to be replicated 
> to the target cluster. However, we got the same error even after turning of 
> CDCR - which indicates CDCR wasn't the problem after all.
> 
> After indexing between 20 000 to 35 000 documents to the source cluster does 
> the File Descriptor Count reach 4096 for one of the solr-nodes - and the 
> respective node crashes. The count grows quite linearly as time goes. The 
> remaining 2 nodes in the cluster is not affected at all, and their logs had 
> no relevant posts.  We found the following errors for the crashing node in 
> its log:
> 
> 2016-06-30 08:23:12.459 ERROR 
> (updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1
>  x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) 
> [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] 
> o.a.s.u.StreamingSolrClients error
> java.net.SocketException: Too many open files
> (...)
> 2016-06-30 08:23:12.460 ERROR 
> (updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1
>  x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) 
> [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] 
> o.a.s.u.StreamingSolrClients error
> java.net.SocketException: Too many open files
> (...)
> 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
> x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase 
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  2 Async exceptions during distributed update:
> Too many open files
> Too many open files
> (...)
> 2016-06-30 08:23:12.461 INFO  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
> x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  
> webapp=/solr path=/update params={version=2.2} status=-1 QTime=5
> 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
> x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall 
> null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  2 Async exceptions during distributed update:
> Too many open files
> Too many open files
> ()
> 
> 2016-06-30 08:23:12.461 WARN  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
> x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1
> 2016-06-30 08:23:38.108 INFO  (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 
> x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  
> webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=https://10.0.106.115:443/solr/DIPS_shard1_replica1/=10=2=*:*=1467275018057=true=javabin&_=1467275017220}
>  hits=30218 status=0 QTime=1
> 
> Running netstat -n -p on the VM that yields the exceptions reveals that there 
> is at least 1 800 TCP connections (not counted how many - the netstat command 
> filled the entire PuTTY window yielding 2 000 lines) waiting to be closed:
> tcp6  70  0 10.0.106.115:34531  10.0.106.114:443
> CLOSE_WAIT  21658/java
> We're running the SolrCloud on 443, and the IP's belong to the VMs. We also 
> tried adjusting the ulimit for the machine to 100 000 - without any results..
> 
> Greetings,
> Mads
> 


Solr node crashes while indexing - Too many open files

2016-06-30 Thread Mads Tomasgård Bjørgan
Hello,
We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by 
utilizing ZooKeeper 3.4.8.

We have two ensembles - and both clusters are running on three of their own 
respective VMs (CentOS 7). We first thought the error was due to CDCR - as we 
were trying to index a large amount of documents which had to be replicated to 
the target cluster. However, we got the same error even after turning of CDCR - 
which indicates CDCR wasn't the problem after all.

After indexing between 20 000 to 35 000 documents to the source cluster does 
the File Descriptor Count reach 4096 for one of the solr-nodes - and the 
respective node crashes. The count grows quite linearly as time goes. The 
remaining 2 nodes in the cluster is not affected at all, and their logs had no 
relevant posts.  We found the following errors for the crashing node in its log:

2016-06-30 08:23:12.459 ERROR 
(updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1
 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) 
[c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] 
o.a.s.u.StreamingSolrClients error
java.net.SocketException: Too many open files
(...)
2016-06-30 08:23:12.460 ERROR 
(updateExecutor-2-thread-22-processing-https:10.0.106.168:443//solr//DIPS_shard3_replica1
 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) 
[c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] 
o.a.s.u.StreamingSolrClients error
java.net.SocketException: Too many open files
(...)
2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase 
org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
 2 Async exceptions during distributed update:
Too many open files
Too many open files
(...)
2016-06-30 08:23:12.461 INFO  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr 
path=/update params={version=2.2} status=-1 QTime=5
2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall 
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
 2 Async exceptions during distributed update:
Too many open files
Too many open files
()

2016-06-30 08:23:12.461 WARN  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1
2016-06-30 08:23:38.108 INFO  (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr 
path=/select 
params={df=_text_=false=id=score=4=0=true=https://10.0.106.115:443/solr/DIPS_shard1_replica1/=10=2=*:*=1467275018057=true=javabin&_=1467275017220}
 hits=30218 status=0 QTime=1

Running netstat -n -p on the VM that yields the exceptions reveals that there 
is at least 1 800 TCP connections (not counted how many - the netstat command 
filled the entire PuTTY window yielding 2 000 lines) waiting to be closed:
tcp6  70  0 10.0.106.115:34531  10.0.106.114:443CLOSE_WAIT  
21658/java
We're running the SolrCloud on 443, and the IP's belong to the VMs. We also 
tried adjusting the ulimit for the machine to 100 000 - without any results..

Greetings,
Mads