Re: Solr recovery issue in 7.5
I'm still pretty clueless trying to find the root cause of this behavior. One thing is pretty consistent that whenever a node restarts up and sends a recovery command, the recipient shard/replica goes down due to sudden surge in old gen heap space. Within minutes, it hits the ceiling and stall the server. And this keeps one going in circles. After moving to 7.5, we decided to switch to G1 from CMS. We are using the recommended settings from Shawn's blog. GC_TUNE="-XX:+UseG1GC \ -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=250 \ -XX:InitiatingHeapOccupancyPercent=75 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ -XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh" Can this be tuned better to avoid this? Also, I'm curios to know if any 7.5 user has experienced similar scenario. Can there be some major change related to recovery that I might be missing after porting from 6.6? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr recovery issue in 7.5
Well, if your indexing is very light, then the commit interval isn't a big deal. a false start on my part. The key here is now long a replay of the tlog would take if it had to be replayed. Probably not an issue though if you have minimal update rates On Fri, Dec 14, 2018 at 12:48 AM shamik wrote: > > Thanks Eric. I guess I was not clear when I mentioned that I had stopped the > indexing process. It was just a temporary step to make sure that we are not > adding any new data when the nodes are in a recovery mode. The 10 minute > hard commit is carried over from our 6.5 configuration which actually > followed the guiding principle of "Index light- Query light/heavy" from the > same document you mentioned. I would surely try out with a 15 sec hard > commit, opensearcher=false and 10 min softcommit and see if it makes a > difference. I'm working on getting a heap dump and see if it provides any > red flag. > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr recovery issue in 7.5
Thanks Eric. I guess I was not clear when I mentioned that I had stopped the indexing process. It was just a temporary step to make sure that we are not adding any new data when the nodes are in a recovery mode. The 10 minute hard commit is carried over from our 6.5 configuration which actually followed the guiding principle of "Index light- Query light/heavy" from the same document you mentioned. I would surely try out with a 15 sec hard commit, opensearcher=false and 10 min softcommit and see if it makes a difference. I'm working on getting a heap dump and see if it provides any red flag. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr recovery issue in 7.5
Yeah, this is weird. First of all, recovery shouldn't take much memory even if it's a full recovery, it's just file copies. And there shouldn't be any problem with the leadership changing, that should _not_ require a replication. Your autocommit max time is a bit worrisome. You have docs hanging around in the tlog for 10 minutes, potentially they must be replayed at startup which could get weird. You say that you stop indexing, but do you wait for at least 10 minutes afterwards? Hard commits (autocommit) with openSearcher=false is actually a pretty inexpensive operation. See: https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/. I'd recommend making these much shorter, say 15 seconds. That's a shot in the dark as a factor in your situation though. Next up wold be a heap dump to see where the memory spike comes from, 'cause I'm out of ideas... Best, Erick On Wed, Dec 12, 2018 at 8:20 PM shamik wrote: > > Erick, > >Thanks for your input. All our fields (for facet, group & sort) have > docvalues enabled since 6.5. That includes the id field. Here's the field > cache entry: > > CACHE.core.fieldCache.entries_count:0 > CACHE.core.fieldCache.total_size: 0 bytes > > Based on whatever I've seen so far, I think zookeeper is not the culprit > here. All the nodes including zookeeper was setup recently. The all are all > inside the same VPC within the same AZ. The instances talk to each other > through a dedicated network. Both zookeeper and Solr instances have SSDs. > > Here's what's happening based on my observation. Whenever an instance is > getting restarted, it initiates a preprecovery command to its leader or a > different node in the other shard. The node which receives the recovery > request is the one which is due to go down next. Within few minutes, the > heap size (old gen) reaches the max allocated heap, thus stalling the > process. I guess due to this, it fails to send the credentials for a > zookeeper session within the stipulated timeframe, which is why zookeeper > terminates the session. Here's from the startup log. > > 2018-12-13 04:02:34.910 INFO > (recoveryExecutor-4-thread-1-processing-n:x.x.193.244:8983_solr > x:knowledge_shard2_replica_n4 c:knowledge s:shard2 r:core_node9) > [c:knowledge s:shard2 r:core_node9 x:knowledge_shard2_replica_n4] > o.a.s.c.RecoveryStrategy Sending prep recovery command to > [http://x.x.240.225:8983/solr]; [WaitForState: > action=PREPRECOVERY&core=knowledge_shard2_replica_n6&nodeName=x.x.x.244:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] > > The node sends a recovery command to its replica which immediately triggers > G1 Old Gen jvm pool to reach the max heap size. Please see the screenshot > below which shows the sudden jump in heap size. We've made sure that the > indexing process is completely switched off at this point, so there's no > commit happening. > > JVM Pool --> https://www.dropbox.com/s/5s0igznhrol6c05/jvm_pool_1.png?dl=0 > > I'm totally puzzled by this weird behavior, never seen something like this > before. Could G1GC settings be contributing to this issue? > > From zookeeper log: > > 2018-12-13 03:47:27,905 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - > Accepted socket connection from /10.0.0.160:58376 > 2018-12-13 03:47:27,905 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - > Accepted socket connection from /10.0.0.160:58378 > 2018-12-13 03:47:27,905 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client > attempting to establish new session at /10.0.0.160:58376 > 2018-12-13 03:47:27,905 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x0, likely client has closed > socket > 2018-12-13 03:47:27,905 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed > socket connection for client /10.0.0.160:58378 (no session established for > client) > 2018-12-13 03:47:27,907 [myid:1] - INFO > [CommitProcessor:1:ZooKeeperServer@683] - Established session > 0x100c46d01440072 with negotiated timeout 1 for client /10.0.0.160:58376 > 2018-12-13 03:47:39,386 [myid:1] - INFO > [CommitProcessor:1:NIOServerCnxn@1040] - Closed socket connection for client > /10.0.0.160:58376 which had sessionid 0x100c46d01440072 > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr recovery issue in 7.5
Erick, Thanks for your input. All our fields (for facet, group & sort) have docvalues enabled since 6.5. That includes the id field. Here's the field cache entry: CACHE.core.fieldCache.entries_count:0 CACHE.core.fieldCache.total_size: 0 bytes Based on whatever I've seen so far, I think zookeeper is not the culprit here. All the nodes including zookeeper was setup recently. The all are all inside the same VPC within the same AZ. The instances talk to each other through a dedicated network. Both zookeeper and Solr instances have SSDs. Here's what's happening based on my observation. Whenever an instance is getting restarted, it initiates a preprecovery command to its leader or a different node in the other shard. The node which receives the recovery request is the one which is due to go down next. Within few minutes, the heap size (old gen) reaches the max allocated heap, thus stalling the process. I guess due to this, it fails to send the credentials for a zookeeper session within the stipulated timeframe, which is why zookeeper terminates the session. Here's from the startup log. 2018-12-13 04:02:34.910 INFO (recoveryExecutor-4-thread-1-processing-n:x.x.193.244:8983_solr x:knowledge_shard2_replica_n4 c:knowledge s:shard2 r:core_node9) [c:knowledge s:shard2 r:core_node9 x:knowledge_shard2_replica_n4] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://x.x.240.225:8983/solr]; [WaitForState: action=PREPRECOVERY&core=knowledge_shard2_replica_n6&nodeName=x.x.x.244:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] The node sends a recovery command to its replica which immediately triggers G1 Old Gen jvm pool to reach the max heap size. Please see the screenshot below which shows the sudden jump in heap size. We've made sure that the indexing process is completely switched off at this point, so there's no commit happening. JVM Pool --> https://www.dropbox.com/s/5s0igznhrol6c05/jvm_pool_1.png?dl=0 I'm totally puzzled by this weird behavior, never seen something like this before. Could G1GC settings be contributing to this issue? >From zookeeper log: 2018-12-13 03:47:27,905 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /10.0.0.160:58376 2018-12-13 03:47:27,905 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /10.0.0.160:58378 2018-12-13 03:47:27,905 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client attempting to establish new session at /10.0.0.160:58376 2018-12-13 03:47:27,905 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data from client sessionid 0x0, likely client has closed socket 2018-12-13 03:47:27,905 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /10.0.0.160:58378 (no session established for client) 2018-12-13 03:47:27,907 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@683] - Established session 0x100c46d01440072 with negotiated timeout 1 for client /10.0.0.160:58376 2018-12-13 03:47:39,386 [myid:1] - INFO [CommitProcessor:1:NIOServerCnxn@1040] - Closed socket connection for client /10.0.0.160:58376 which had sessionid 0x100c46d01440072 -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr recovery issue in 7.5
Whenever I see "memory consumption changed", my first question is are any fields that sort, group or facet set with docValues="false"? I consider this unlikely since one of the changes recently was to default to "true" for primitive types, but it's worth checking. The easiest way to check (and this does not have to be on a node that's having problems, any node will do that's been serving queries for a while) is to go into the admin UI>>select_a_core>>plugins/stats>>cache>>fieldCache. Are there any entries there? This part of your logs is very suspicious: 2018-12-12 19:57:13.730 WARN (recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7) [c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1] o.a.s.c.ZkController Unable to read /collections/knowledge/leader_initiated_recovery/shard1/core_node7 due to: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/knowledge/leader_initiated_recovery/shard1/core_node7 Solr 7.5 uses ZK as part of knowing whether shards are up to date, there was a presentation at Activate by Dat Cao Manh about the details, but I think if your ZK connection is timing out that may be the underlying cause. with a 60 second timeout that would be odd to be just a timeout issue, so that's puzzling. Have you done anything odd with ZooKeeper? Like replace nodes or the like? ZK had an issue at one point where resolving the DNS names for reconfigured ZK nodes didn't work well. If you've restarted all your instances (zk first of course) that wouldn't be relevant. Just a few random thoughts... Best, Erick On Wed, Dec 12, 2018 at 2:09 PM Shamik Bandopadhyay wrote: > > Hi, > > We recently upgraded Solr from 6.5 to 7.5. We are observing some weird > issues in terms of recovery and memory usage. We've a cluster of 6 physical > nodes with 2 shards having two replica each. 7.5 seemed to have a higher > memory consumption where the average heap utilization hovers around 18 gb. > Couple of days back, one of the replicas went down as heap (30 gb) was > exhausted. Upon restart, the instance came back quickly but then started a > spiral effect where one of the nodes in the cluster kept going down one > after the other. So at any point of time, there were 5 instances available > instead of 6. Every time we would bring the bad instance back up, it would > be functional immediately but the shard it was recovering from will > eventually (within minutes) go down . This cycle (of restarting instances) > went for almost an hour before all the nodes were finally started active. > It again occurred today where we are observing similar behavior. We even > stopped the indexing pipeline to make sure that recovery is minimal. But it > didn't make any difference, the error is consistent, the affected node goes > into a recovery (not sure why) and encounters session time out in the > process. > > 2018-12-12 19:59:16.813 ERROR > (recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr > x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7) > [c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1] > o.a.s.c.RecoveryStrategy Error while trying to recover. > core=knowledge_shard1_replica_n1:org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /overseer/queue/qn- > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:786) > at > org.apache.solr.common.cloud.SolrZkClient.lambda$create$7(SolrZkClient.java:398) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) > at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:398) > at > org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:321) > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1548) > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1436) > at > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:549) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:310) > at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:294) > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) > at java.base/
Solr recovery issue in 7.5
Hi, We recently upgraded Solr from 6.5 to 7.5. We are observing some weird issues in terms of recovery and memory usage. We've a cluster of 6 physical nodes with 2 shards having two replica each. 7.5 seemed to have a higher memory consumption where the average heap utilization hovers around 18 gb. Couple of days back, one of the replicas went down as heap (30 gb) was exhausted. Upon restart, the instance came back quickly but then started a spiral effect where one of the nodes in the cluster kept going down one after the other. So at any point of time, there were 5 instances available instead of 6. Every time we would bring the bad instance back up, it would be functional immediately but the shard it was recovering from will eventually (within minutes) go down . This cycle (of restarting instances) went for almost an hour before all the nodes were finally started active. It again occurred today where we are observing similar behavior. We even stopped the indexing pipeline to make sure that recovery is minimal. But it didn't make any difference, the error is consistent, the affected node goes into a recovery (not sure why) and encounters session time out in the process. 2018-12-12 19:59:16.813 ERROR (recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7) [c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1] o.a.s.c.RecoveryStrategy Error while trying to recover. core=knowledge_shard1_replica_n1:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue/qn- at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:786) at org.apache.solr.common.cloud.SolrZkClient.lambda$create$7(SolrZkClient.java:398) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:398) at org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:321) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1548) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1436) at org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:549) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:310) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:294) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) I've added the dropbox links to the relevant error from solr log, solr gc log, top screenshot and heap usage from spm monitor. Solr log --> https://www.dropbox.com/s/als26je0dgrp10r/solr.log?dl=0 Solr gc log 1 --> https://www.dropbox.com/s/m0ikb3kc9enme4f/solr_gc.log.0?dl=0 Solr gc log 2 --> https://www.dropbox.com/s/jfs7wcjyult5ud8/solr_gc.log.1?dl=0 Top --> https://www.dropbox.com/s/x6f0zwqfbabybd0/top.png?dl=0 SPM monitor screenshots: JVM pool --> https://www.dropbox.com/s/nbko83eflp8y2tp/jvm_pool.png?dl=0 GC screenshot --> https://www.dropbox.com/s/6zofkvgfknxwjgd/gc.png?dl=0 Solr Cache --> https://www.dropbox.com/s/o6zsxwal6pzspve/cache.png?dl=0 Here are the relevant entries from solr startup script: SOLR_JAVA_MEM="-Xms40g -Xmx40g" GC_LOG_OPTS='-Xlog:gc*' GC_TUNE="-XX:+UseG1GC \ -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=250 \ -XX:InitiatingHeapOccupancyPercent=75 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ -XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh" ZK_CLIENT_TIMEOUT="6" SOLR_WAIT_FOR_ZK="180" SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=12" SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=60" SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x20" SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=12" All our nodes are running on AWS having 16 vCPU and 128gb . We started with 30gb heap space as the average utilization was between 18-19 gb. For this recovery issue, we tried bumping up to 40gb but didn't make any difference. We are using jdk 9.0.4+11 with 6 Solr nodes and 3 zookeeper (3.4.10) quorum. Out index has close to 10 million documents with 55gb index size. Not sure if it's relevant, but we have noticed that filter cache utilization has drastically reduced (0.17) while document cache