Re: SolrCloud - "KeeperErrorCode = NoNode" - after restart

2013-12-22 Thread Mark Miller
I don't know that I've ever seen anyone test so many cores with SolrCloud.
Perhaps there is a timeout that is too low, or ...

Can you file a JIRA issue? I can do some tests.


On Fri, Dec 20, 2013 at 11:22 AM, Bojan Šmid  wrote:

> Hi,
>
>   I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around
> 2000 collections (each with single shard, each shard having 1 or 2
> replicas), running on Tomcat. Each Solr node hosts around 1000 physical
> cores.
>
>   When starting any node, I almost always see errors like:
>
> 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR
> org.apache.solr.cloud.ZkController- Error getting leader from zk
> org.apache.solr.common.SolrException: Could not get leader props
> at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945)
> at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909)
> at
> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873)
> at
> org.apache.solr.cloud.ZkController.register(ZkController.java:807)
> at
> org.apache.solr.cloud.ZkController.register(ZkController.java:757)
> at
> org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272)
> at
> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261)
> at
>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>
>   It happens just for some cores, usually for about 10-20 of them out of
> 1000 on one node (each time different cores fail). These 10-20 cores are
> then marked as "down" and they are never "recovered", while other cores
> work ok.
>
>   I did check ZK, there really is no node
> "/collections/core_20131120/leaders/shard1", but
> "/collections/core_20131120/leaders" exists, so it looks like "shard1" was
> removed (maybe during previous shutdown?).
>
>   Also, when I stop all nodes and clear ZK state, and after that start Solr
> (rolling starting nodes one by one), all nodes start properly and all cores
> are properly loaded ("active"). But after that, first restart of any Solr
> node causes issues on that node.
>
>   Any ideas about possible cause? And shouldn't Solr maybe try to recover
> from such situation?
>
>   Thanks,
>
>   Bojan
>



-- 
- Mark


Re: SolrCloud - "KeeperErrorCode = NoNode" - after restart

2013-12-22 Thread Otis Gospodnetic
Maybe https://issues.apache.org/jira/browse/SOLR-5569 will help?

A few related issues:
https://issues.apache.org/jira/browse/SOLR-5568
https://issues.apache.org/jira/browse/SOLR-5552

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Fri, Dec 20, 2013 at 11:22 AM, Bojan Šmid  wrote:

> Hi,
>
>   I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around
> 2000 collections (each with single shard, each shard having 1 or 2
> replicas), running on Tomcat. Each Solr node hosts around 1000 physical
> cores.
>
>   When starting any node, I almost always see errors like:
>
> 2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR
> org.apache.solr.cloud.ZkController- Error getting leader from zk
> org.apache.solr.common.SolrException: Could not get leader props
> at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945)
> at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909)
> at
> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873)
> at
> org.apache.solr.cloud.ZkController.register(ZkController.java:807)
> at
> org.apache.solr.cloud.ZkController.register(ZkController.java:757)
> at
> org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272)
> at
> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261)
> at
>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>
>   It happens just for some cores, usually for about 10-20 of them out of
> 1000 on one node (each time different cores fail). These 10-20 cores are
> then marked as "down" and they are never "recovered", while other cores
> work ok.
>
>   I did check ZK, there really is no node
> "/collections/core_20131120/leaders/shard1", but
> "/collections/core_20131120/leaders" exists, so it looks like "shard1" was
> removed (maybe during previous shutdown?).
>
>   Also, when I stop all nodes and clear ZK state, and after that start Solr
> (rolling starting nodes one by one), all nodes start properly and all cores
> are properly loaded ("active"). But after that, first restart of any Solr
> node causes issues on that node.
>
>   Any ideas about possible cause? And shouldn't Solr maybe try to recover
> from such situation?
>
>   Thanks,
>
>   Bojan
>


SolrCloud - "KeeperErrorCode = NoNode" - after restart

2013-12-20 Thread Bojan Šmid
Hi,

  I have a cluster with 5 Solr nodes (4.6 release) and 5 ZKs, with around
2000 collections (each with single shard, each shard having 1 or 2
replicas), running on Tomcat. Each Solr node hosts around 1000 physical
cores.

  When starting any node, I almost always see errors like:

2013-12-19 18:45:42,454 [coreLoadExecutor-4-thread-721] ERROR
org.apache.solr.cloud.ZkController- Error getting leader from zk
org.apache.solr.common.SolrException: Could not get leader props
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:945)
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:909)
at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:873)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:807)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:757)
at
org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:272)
at
org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:489)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/core6_20131120/leaders/shard1
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:264)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:261)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)

  It happens just for some cores, usually for about 10-20 of them out of
1000 on one node (each time different cores fail). These 10-20 cores are
then marked as "down" and they are never "recovered", while other cores
work ok.

  I did check ZK, there really is no node
"/collections/core_20131120/leaders/shard1", but
"/collections/core_20131120/leaders" exists, so it looks like "shard1" was
removed (maybe during previous shutdown?).

  Also, when I stop all nodes and clear ZK state, and after that start Solr
(rolling starting nodes one by one), all nodes start properly and all cores
are properly loaded ("active"). But after that, first restart of any Solr
node causes issues on that node.

  Any ideas about possible cause? And shouldn't Solr maybe try to recover
from such situation?

  Thanks,

  Bojan