Re: Solr recovery issue in 7.5

2018-12-17 Thread shamik
I'm still pretty clueless trying to find the root cause of this behavior. One
thing is pretty consistent that whenever a node restarts up and sends a
recovery command, the recipient shard/replica goes down due to sudden surge
in old gen heap space. Within minutes, it hits the ceiling and stall the
server. And this keeps one going in circles. After moving to 7.5, we decided
to switch to G1 from CMS. We are using the recommended settings from Shawn's
blog.

GC_TUNE="-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
-XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh"

Can this be tuned better to avoid this?

Also, I'm curios to know if any 7.5 user has experienced similar scenario.
Can there be some major change related to recovery that I might be missing
after porting from 6.6?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr recovery issue in 7.5

2018-12-14 Thread Erick Erickson
Well, if your indexing is very light, then the commit interval isn't a
big deal. a false start on my part.

The key here is now long a replay of the tlog would take if it had to
be replayed. Probably not
an issue though if you have minimal update rates


On Fri, Dec 14, 2018 at 12:48 AM shamik  wrote:
>
> Thanks Eric. I guess I was not clear when I mentioned that I had stopped the
> indexing process. It was just a temporary step to make sure that we are not
> adding any new data when the nodes are in a recovery mode. The 10 minute
> hard commit is carried over from our 6.5 configuration which actually
> followed the guiding principle  of "Index light- Query light/heavy" from the
> same document you mentioned. I would surely try out  with a 15 sec hard
> commit, opensearcher=false and 10 min softcommit and see if it makes a
> difference. I'm working on getting a heap dump and see if it provides any
> red flag.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr recovery issue in 7.5

2018-12-14 Thread shamik
Thanks Eric. I guess I was not clear when I mentioned that I had stopped the
indexing process. It was just a temporary step to make sure that we are not
adding any new data when the nodes are in a recovery mode. The 10 minute
hard commit is carried over from our 6.5 configuration which actually
followed the guiding principle  of "Index light- Query light/heavy" from the
same document you mentioned. I would surely try out  with a 15 sec hard
commit, opensearcher=false and 10 min softcommit and see if it makes a
difference. I'm working on getting a heap dump and see if it provides any
red flag. 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr recovery issue in 7.5

2018-12-13 Thread Erick Erickson
Yeah, this is weird. First of all, recovery shouldn't take much memory
even if it's a full recovery, it's just file copies. And there
shouldn't be any problem with the leadership changing, that should
_not_ require a replication.

Your autocommit max time is  a bit worrisome. You have docs hanging
around in the tlog for 10 minutes, potentially they must be replayed
at startup which could get weird. You say that you stop indexing, but
do you wait for at least 10 minutes afterwards? Hard commits
(autocommit) with openSearcher=false is actually a pretty inexpensive
operation. See:
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/.
I'd recommend making these much shorter, say 15 seconds. That's a shot
in the dark as a factor in your situation though.

Next up wold be a heap dump to see where the memory spike comes from,
'cause I'm out of ideas...

Best,
Erick
On Wed, Dec 12, 2018 at 8:20 PM shamik  wrote:
>
> Erick,
>
>Thanks for your input. All our fields (for facet, group & sort) have
> docvalues enabled since 6.5. That includes the id field. Here's the field
> cache entry:
>
> CACHE.core.fieldCache.entries_count:0
> CACHE.core.fieldCache.total_size:  0 bytes
>
> Based on whatever I've seen so far, I think zookeeper is not the culprit
> here. All the nodes including zookeeper was setup recently. The all are all
> inside the same VPC within the same AZ. The instances talk to each other
> through a dedicated network. Both zookeeper and Solr instances have SSDs.
>
> Here's what's happening based on my observation. Whenever an instance is
> getting restarted, it initiates a preprecovery command to its leader or a
> different node in the other shard. The node which receives the recovery
> request is the one which is due to go down next. Within few minutes, the
> heap size (old gen) reaches the max allocated heap, thus stalling the
> process. I guess due to this, it fails to send the credentials for a
> zookeeper session within the stipulated timeframe, which is why zookeeper
> terminates the session. Here's from the startup log.
>
> 2018-12-13 04:02:34.910 INFO
> (recoveryExecutor-4-thread-1-processing-n:x.x.193.244:8983_solr
> x:knowledge_shard2_replica_n4 c:knowledge s:shard2 r:core_node9)
> [c:knowledge s:shard2 r:core_node9 x:knowledge_shard2_replica_n4]
> o.a.s.c.RecoveryStrategy Sending prep recovery command to
> [http://x.x.240.225:8983/solr]; [WaitForState:
> action=PREPRECOVERY&core=knowledge_shard2_replica_n6&nodeName=x.x.x.244:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
>
> The node sends a recovery command to its replica which immediately triggers
> G1 Old Gen jvm pool to reach the max heap size. Please see the screenshot
> below which shows the sudden jump in heap size. We've made sure that the
> indexing process is completely switched off at this point, so there's no
> commit happening.
>
> JVM Pool --> https://www.dropbox.com/s/5s0igznhrol6c05/jvm_pool_1.png?dl=0
>
> I'm totally puzzled by this weird behavior, never seen something like this
> before. Could G1GC settings be contributing to this issue?
>
> From zookeeper log:
>
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
> Accepted socket connection from /10.0.0.160:58376
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
> Accepted socket connection from /10.0.0.160:58378
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client
> attempting to establish new session at /10.0.0.160:58376
> 2018-12-13 03:47:27,905 [myid:1] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
> read additional data from client sessionid 0x0, likely client has closed
> socket
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed
> socket connection for client /10.0.0.160:58378 (no session established for
> client)
> 2018-12-13 03:47:27,907 [myid:1] - INFO
> [CommitProcessor:1:ZooKeeperServer@683] - Established session
> 0x100c46d01440072 with negotiated timeout 1 for client /10.0.0.160:58376
> 2018-12-13 03:47:39,386 [myid:1] - INFO
> [CommitProcessor:1:NIOServerCnxn@1040] - Closed socket connection for client
> /10.0.0.160:58376 which had sessionid 0x100c46d01440072
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr recovery issue in 7.5

2018-12-12 Thread shamik
Erick,

   Thanks for your input. All our fields (for facet, group & sort) have
docvalues enabled since 6.5. That includes the id field. Here's the field
cache entry:

CACHE.core.fieldCache.entries_count:0
CACHE.core.fieldCache.total_size:  0 bytes

Based on whatever I've seen so far, I think zookeeper is not the culprit
here. All the nodes including zookeeper was setup recently. The all are all
inside the same VPC within the same AZ. The instances talk to each other
through a dedicated network. Both zookeeper and Solr instances have SSDs.

Here's what's happening based on my observation. Whenever an instance is
getting restarted, it initiates a preprecovery command to its leader or a
different node in the other shard. The node which receives the recovery
request is the one which is due to go down next. Within few minutes, the
heap size (old gen) reaches the max allocated heap, thus stalling the
process. I guess due to this, it fails to send the credentials for a
zookeeper session within the stipulated timeframe, which is why zookeeper
terminates the session. Here's from the startup log.

2018-12-13 04:02:34.910 INFO 
(recoveryExecutor-4-thread-1-processing-n:x.x.193.244:8983_solr
x:knowledge_shard2_replica_n4 c:knowledge s:shard2 r:core_node9)
[c:knowledge s:shard2 r:core_node9 x:knowledge_shard2_replica_n4]
o.a.s.c.RecoveryStrategy Sending prep recovery command to
[http://x.x.240.225:8983/solr]; [WaitForState:
action=PREPRECOVERY&core=knowledge_shard2_replica_n6&nodeName=x.x.x.244:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]

The node sends a recovery command to its replica which immediately triggers
G1 Old Gen jvm pool to reach the max heap size. Please see the screenshot
below which shows the sudden jump in heap size. We've made sure that the
indexing process is completely switched off at this point, so there's no
commit happening.

JVM Pool --> https://www.dropbox.com/s/5s0igznhrol6c05/jvm_pool_1.png?dl=0

I'm totally puzzled by this weird behavior, never seen something like this
before. Could G1GC settings be contributing to this issue? 

>From zookeeper log:

2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
Accepted socket connection from /10.0.0.160:58376
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
Accepted socket connection from /10.0.0.160:58378
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client
attempting to establish new session at /10.0.0.160:58376
2018-12-13 03:47:27,905 [myid:1] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x0, likely client has closed
socket
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed
socket connection for client /10.0.0.160:58378 (no session established for
client)
2018-12-13 03:47:27,907 [myid:1] - INFO 
[CommitProcessor:1:ZooKeeperServer@683] - Established session
0x100c46d01440072 with negotiated timeout 1 for client /10.0.0.160:58376
2018-12-13 03:47:39,386 [myid:1] - INFO 
[CommitProcessor:1:NIOServerCnxn@1040] - Closed socket connection for client
/10.0.0.160:58376 which had sessionid 0x100c46d01440072



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr recovery issue in 7.5

2018-12-12 Thread Erick Erickson
Whenever I see "memory consumption changed",  my first question is are
any fields that sort, group or facet set with docValues="false"? I
consider this unlikely since one of the changes recently was to
default to "true" for primitive types, but it's worth checking. The
easiest way to check (and this does not have to be on a node that's
having problems, any node will do that's been serving queries for a
while) is to go into the admin
UI>>select_a_core>>plugins/stats>>cache>>fieldCache. Are there any
entries there?

This part of your logs is very suspicious:

2018-12-12 19:57:13.730 WARN
(recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr
x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7)
[c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1]
o.a.s.c.ZkController Unable to read
/collections/knowledge/leader_initiated_recovery/shard1/core_node7 due
to: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/collections/knowledge/leader_initiated_recovery/shard1/core_node7

Solr 7.5 uses ZK as part of knowing whether shards are up to date,
there was a presentation at Activate by  Dat Cao Manh about the
details, but I think if your ZK connection is timing out that may be
the underlying cause. with a 60 second timeout that would be odd to be
just a timeout issue, so that's puzzling.

Have you done anything odd with ZooKeeper? Like replace nodes or the
like? ZK had an issue at one point where resolving the DNS names for
reconfigured ZK nodes didn't work well. If you've restarted all your
instances (zk first of course) that wouldn't be relevant.

Just a few random thoughts...

Best,
Erick
On Wed, Dec 12, 2018 at 2:09 PM Shamik Bandopadhyay  wrote:
>
> Hi,
>
>   We recently upgraded Solr from 6.5 to 7.5. We are observing some weird
> issues in terms of recovery and memory usage. We've a cluster of 6 physical
> nodes with 2 shards having two replica each. 7.5 seemed to have a higher
> memory consumption where the average heap utilization hovers around 18 gb.
> Couple of days back, one of the replicas went down as heap (30 gb) was
> exhausted. Upon restart, the instance came back quickly but then started a
> spiral effect where one of the nodes in the cluster kept going down one
> after the other. So at any point of time, there were 5 instances available
> instead of 6. Every time we would bring the bad instance back up, it would
> be functional immediately but the shard it was recovering from will
> eventually (within minutes) go down . This cycle (of restarting instances)
> went for almost an hour before all the nodes were finally started active.
> It again occurred today where we are observing similar behavior. We even
> stopped the indexing pipeline to make sure that recovery is minimal. But it
> didn't make any difference, the error is consistent, the affected node goes
> into a recovery (not sure why) and encounters session time out in the
> process.
>
> 2018-12-12 19:59:16.813 ERROR
> (recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr
> x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7)
> [c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1]
> o.a.s.c.RecoveryStrategy Error while trying to recover.
> core=knowledge_shard1_replica_n1:org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer/queue/qn-
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:786)
>  at
> org.apache.solr.common.cloud.SolrZkClient.lambda$create$7(SolrZkClient.java:398)
>  at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>  at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:398)
>  at
> org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:321)
>  at org.apache.solr.cloud.ZkController.publish(ZkController.java:1548)
>  at org.apache.solr.cloud.ZkController.publish(ZkController.java:1436)
>  at
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:549)
>  at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:310)
>  at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:294)
>  at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>  at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
>  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>  at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>  at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
>  at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
>  at java.base/

Solr recovery issue in 7.5

2018-12-12 Thread Shamik Bandopadhyay
Hi,

  We recently upgraded Solr from 6.5 to 7.5. We are observing some weird
issues in terms of recovery and memory usage. We've a cluster of 6 physical
nodes with 2 shards having two replica each. 7.5 seemed to have a higher
memory consumption where the average heap utilization hovers around 18 gb.
Couple of days back, one of the replicas went down as heap (30 gb) was
exhausted. Upon restart, the instance came back quickly but then started a
spiral effect where one of the nodes in the cluster kept going down one
after the other. So at any point of time, there were 5 instances available
instead of 6. Every time we would bring the bad instance back up, it would
be functional immediately but the shard it was recovering from will
eventually (within minutes) go down . This cycle (of restarting instances)
went for almost an hour before all the nodes were finally started active.
It again occurred today where we are observing similar behavior. We even
stopped the indexing pipeline to make sure that recovery is minimal. But it
didn't make any difference, the error is consistent, the affected node goes
into a recovery (not sure why) and encounters session time out in the
process.

2018-12-12 19:59:16.813 ERROR
(recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr
x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7)
[c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=knowledge_shard1_replica_n1:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue/qn-
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
 at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:786)
 at
org.apache.solr.common.cloud.SolrZkClient.lambda$create$7(SolrZkClient.java:398)
 at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
 at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:398)
 at
org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:321)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1548)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1436)
 at
org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:549)
 at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:310)
 at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:294)
 at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
 at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
 at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
 at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
 at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
 at java.base/java.lang.Thread.run(Thread.java:844)

I've added the dropbox links to the relevant error from solr log, solr gc
log, top screenshot and heap usage from spm monitor.

Solr log --> https://www.dropbox.com/s/als26je0dgrp10r/solr.log?dl=0

Solr gc log 1 -->
https://www.dropbox.com/s/m0ikb3kc9enme4f/solr_gc.log.0?dl=0

Solr gc log 2 -->
https://www.dropbox.com/s/jfs7wcjyult5ud8/solr_gc.log.1?dl=0

Top --> https://www.dropbox.com/s/x6f0zwqfbabybd0/top.png?dl=0

SPM monitor screenshots:

JVM pool --> https://www.dropbox.com/s/nbko83eflp8y2tp/jvm_pool.png?dl=0

GC screenshot --> https://www.dropbox.com/s/6zofkvgfknxwjgd/gc.png?dl=0

Solr Cache --> https://www.dropbox.com/s/o6zsxwal6pzspve/cache.png?dl=0

Here are the relevant entries from solr startup script:

SOLR_JAVA_MEM="-Xms40g -Xmx40g"

GC_LOG_OPTS='-Xlog:gc*'

GC_TUNE="-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
-XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh"

ZK_CLIENT_TIMEOUT="6"

SOLR_WAIT_FOR_ZK="180"

SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=12"
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=60"
SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x20"
SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=12"

All our nodes are running on AWS having 16 vCPU and 128gb . We started with
30gb heap space as the average utilization was between 18-19 gb. For this
recovery issue, we tried bumping up to 40gb but didn't make any difference.
We are using jdk 9.0.4+11 with 6 Solr nodes and 3 zookeeper (3.4.10) quorum.

Out index has close to 10 million documents with 55gb index size. Not sure
if it's relevant, but we have noticed that filter cache utilization has
drastically reduced (0.17) while document cache