Re: SolrCloud - "KeeperErrorCode = NoNode" - after restart
I don't know that I've ever seen anyone test so many cores with SolrCloud. Perhaps there is a timeout that is too low, or ... Can you file a JIRA issue? I can do some tests. On Fri, Dec 20, 2013 at 11:22 AM, Bojan Šmid wrote: > Hi, > > I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around > 2000 collections (each with single shard, each shard having 1 or 2 > replicas), running on Tomcat. Each Solr node hosts around 1000 physical > cores. > > When starting any node, I almost always see errors like: > > 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR > org.apache.solr.cloud.ZkController- Error getting leader from zk > org.apache.solr.common.SolrException: Could not get leader props > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909) > at > org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873) > at > org.apache.solr.cloud.ZkController.register(ZkController.java:807) > at > org.apache.solr.cloud.ZkController.register(ZkController.java:757) > at > org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272) > at > org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489) > at > org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272) > at > org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261) > at > > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) > > It happens just for some cores, usually for about 10-20 of them out of > 1000 on one node (each time different cores fail). These 10-20 cores are > then marked as "down" and they are never "recovered", while other cores > work ok. > > I did check ZK, there really is no node > "/collections/core_20131120/leaders/shard1", but > "/collections/core_20131120/leaders" exists, so it looks like "shard1" was > removed (maybe during previous shutdown?). > > Also, when I stop all nodes and clear ZK state, and after that start Solr > (rolling starting nodes one by one), all nodes start properly and all cores > are properly loaded ("active"). But after that, first restart of any Solr > node causes issues on that node. > > Any ideas about possible cause? And shouldn't Solr maybe try to recover > from such situation? > > Thanks, > > Bojan > -- - Mark
Re: SolrCloud - "KeeperErrorCode = NoNode" - after restart
Maybe https://issues.apache.org/jira/browse/SOLR-5569 will help? A few related issues: https://issues.apache.org/jira/browse/SOLR-5568 https://issues.apache.org/jira/browse/SOLR-5552 Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Fri, Dec 20, 2013 at 11:22 AM, Bojan Šmid wrote: > Hi, > > I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around > 2000 collections (each with single shard, each shard having 1 or 2 > replicas), running on Tomcat. Each Solr node hosts around 1000 physical > cores. > > When starting any node, I almost always see errors like: > > 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR > org.apache.solr.cloud.ZkController- Error getting leader from zk > org.apache.solr.common.SolrException: Could not get leader props > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909) > at > org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873) > at > org.apache.solr.cloud.ZkController.register(ZkController.java:807) > at > org.apache.solr.cloud.ZkController.register(ZkController.java:757) > at > org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272) > at > org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489) > at > org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272) > at > org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261) > at > > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) > > It happens just for some cores, usually for about 10-20 of them out of > 1000 on one node (each time different cores fail). These 10-20 cores are > then marked as "down" and they are never "recovered", while other cores > work ok. > > I did check ZK, there really is no node > "/collections/core_20131120/leaders/shard1", but > "/collections/core_20131120/leaders" exists, so it looks like "shard1" was > removed (maybe during previous shutdown?). > > Also, when I stop all nodes and clear ZK state, and after that start Solr > (rolling starting nodes one by one), all nodes start properly and all cores > are properly loaded ("active"). But after that, first restart of any Solr > node causes issues on that node. > > Any ideas about possible cause? And shouldn't Solr maybe try to recover > from such situation? > > Thanks, > > Bojan >
SolrCloud - "KeeperErrorCode = NoNode" - after restart
Hi, I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around 2000 collections (each with single shard, each shard having 1 or 2 replicas), running on Tomcat. Each Solr node hosts around 1000 physical cores. When starting any node, I almost always see errors like: 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR org.apache.solr.cloud.ZkController- Error getting leader from zk org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909) at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873) at org.apache.solr.cloud.ZkController.register(ZkController.java:807) at org.apache.solr.cloud.ZkController.register(ZkController.java:757) at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272) at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) It happens just for some cores, usually for about 10-20 of them out of 1000 on one node (each time different cores fail). These 10-20 cores are then marked as "down" and they are never "recovered", while other cores work ok. I did check ZK, there really is no node "/collections/core_20131120/leaders/shard1", but "/collections/core_20131120/leaders" exists, so it looks like "shard1" was removed (maybe during previous shutdown?). Also, when I stop all nodes and clear ZK state, and after that start Solr (rolling starting nodes one by one), all nodes start properly and all cores are properly loaded ("active"). But after that, first restart of any Solr node causes issues on that node. Any ideas about possible cause? And shouldn't Solr maybe try to recover from such situation? Thanks, Bojan