I've run into an issue with starting my solr cloud with many collections. My setup is: 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single server (256GB RAM). 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores 1 x Zookeeper 3.4.6 Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
Then I stop all nodes, then start all nodes. All replicas are in the down state, some have no leader. At times I have seen some (12 or so) leaders in the active state. In the solr logs I see lots of: org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DDDDDD-4351 after 30 seconds; our state says http://ftea1:8001/solr/DDDDDD-4351_shard1_replica1/, but ZooKeeper says http://ftea1:8000/solr/DDDDDD-4351_shard1_replica2/ org.apache.solr.common.SolrException; :org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1 at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:910) at org.apache.solr.cloud.ZkController.register(ZkController.java:822) at org.apache.solr.cloud.ZkController.register(ZkController.java:770) at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:221) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.common.SolrException: There is conflicting information about the leader of shard: shard1 our state says: http://ftea1:8001/solr/DDDDDD-1564_shard1_replica2/ but zookeeper says: http://ftea1:8000/solr/DDDDDD-1564_shard1_replica1/ at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:889) ... 6 more I've tried staggering the starts (1min) but does not help. I've reproduced with zero documents. Restarts are OK up to around 3,000 cores. Should this work? Damien.