Vamsee Yarlagadda created SOLR-7198: ---------------------------------------
Summary: Deleting a collection during leader election results in left over znodes in ZK Key: SOLR-7198 URL: https://issues.apache.org/jira/browse/SOLR-7198 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10.3 Reporter: Vamsee Yarlagadda I am seeing this issue while trying to run a collection delete operation in the middle of leader election process. Contents of ZK (after issuing collection delete and waiting for some time) {code} / /aliases.json /clusterstate.json /collections SolrUpgrade_collection leaders shard2_1 /configs /live_nodes /overseer /overseer_elect /solr /solr.xml Contents of znode shard2_1: version0 aversion0 children_count0 ctimeThu Mar 05 22:05:28 PST 2015 (1425621928169) cversion0 czxid22815 ephemeralOwner93427899815755800 mtimeThu Mar 05 22:05:28 PST 2015 (1425621928169) mzxid22815 pzxid22815 dataLength194 { "core":"SolrUpgrade_collection_shard2_1_replica2", "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr", "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr" } {code} Clusterstate.json before running a delete collection {code} { "shards":{ "shard1":{ "range":"80000000-ffffffff", "state":"active", "replicas":{ "core_node1":{ "state":"active", "core":"SolrUpgrade_collection_shard1_replica2", "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr", "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr", "leader":"true"}, "core_node2":{ "state":"active", "core":"SolrUpgrade_collection_shard1_replica1", "node_name":"search-testing-c5-ha-2.vpc.cloudera.com:8983_solr", "base_url":"http://search-testing-c5-ha-2.vpc.cloudera.com:8983/solr"}}}, "shard2_0":{ "range":"0-3fffffff", "state":"active", "replicas":{ "core_node5":{ "state":"active", "core":"SolrUpgrade_collection_shard2_0_replica1", "node_name":"search-testing-c5-ha-3.vpc.cloudera.com:8983_solr", "base_url":"http://search-testing-c5-ha-3.vpc.cloudera.com:8983/solr", "leader":"true"}, "core_node7":{ "state":"active", "core":"SolrUpgrade_collection_shard2_0_replica2", "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr", "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr"}}}, "shard2_1":{ "range":"40000000-7fffffff", "state":"active", "replicas":{ "core_node8":{ "state":"active", "core":"SolrUpgrade_collection_shard2_1_replica2", "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr", "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"}, "core_node9":{ "state":"active", "core":"SolrUpgrade_collection_shard2_1_replica3", "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr", "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"}}}}, "maxShardsPerNode":"10", "router":"compositeId", "replicationFactor":"2", "autoAddReplicas":"false", "routerSpec":{"name":"compositeId"}}} {code} As we can notice, shard (*shard2_1*) doesn't have any leader and i can see from the logs that the replicas of the shard just started the leader election process. And here are the Solr logs from one of the above replicas which eventually becomes the leader and registers in ZK even though the collection was deleted. {code} 2015-03-05 22:05:25,383 INFO org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process for shard shard2_1 2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: Checking if I (core=SolrUpgrade_collection_shard2_1_replica2,coreNodeName=core_node8) should try and be the leader. 2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: My last published State was Active, it's okay to be the leader. 2015-03-05 22:05:25,387 INFO org.apache.solr.cloud.ShardLeaderElectionContext: I may be the new leader - try and sync 2015-03-05 22:05:25,506 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4) 2015-03-05 22:05:25,620 INFO org.apache.solr.core.SolrCore.Request: [SolrUpgrade_collection_shard2_1_replica2] webapp=/solr path=/select params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=1 2015-03-05 22:05:25,623 INFO org.apache.solr.core.SolrCore.Request: [SolrUpgrade_collection_shard2_1_replica3] webapp=/solr path=/select params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=0 2015-03-05 22:05:27,392 INFO org.apache.solr.cloud.ElectionContext: canceling election /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node8-n_0000000001 2015-03-05 22:05:27,393 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica3] CLOSING SolrCore org.apache.solr.core.SolrCore@559b52d3 2015-03-05 22:05:27,393 INFO org.apache.solr.update.UpdateHandler: closing DirectUpdateHandler2{commits=0,autocommit maxTime=60000ms,autocommits=0,soft autocommit maxTime=1000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0} 2015-03-05 22:05:27,394 INFO org.apache.solr.update.SolrCoreState: Closing SolrCoreState 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: SolrCoreState ref count has reached 0 - closing IndexWriter 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: closing IndexWriter with IndexWriterCloser 2015-03-05 22:05:27,397 INFO org.apache.solr.servlet.SolrDispatchFilter: [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica2&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2} status=0 QTime=5 2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process for shard shard2_1 2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ElectionContext: canceling election /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002 2015-03-05 22:05:27,400 WARN org.apache.solr.cloud.LeaderElector: org.apache.solr.common.SolrException: SolrCore not found:SolrUpgrade_collection_shard2_1_replica3 in [] at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:211) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358) at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-03-05 22:05:27,415 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica3] Closing main searcher on request. 2015-03-05 22:05:27,438 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4) 2015-03-05 22:05:27,456 INFO org.apache.solr.core.CachingDirectoryFactory: Closing HdfsDirectoryFactory - 2 directories currently being tracked 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index;done=false>>] 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data;done=false>>] 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data 2015-03-05 22:05:27,471 INFO org.apache.solr.core.CachingDirectoryFactory: Removing directory after core close: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data 2015-03-05 22:05:27,490 INFO org.apache.solr.cloud.ElectionContext: canceling election /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002 2015-03-05 22:05:27,491 WARN org.apache.solr.cloud.ElectionContext: cancelElection did not find election node to remove /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002 2015-03-05 22:05:27,493 INFO org.apache.solr.servlet.SolrDispatchFilter: [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica3&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2} status=0 QTime=100 2015-03-05 22:05:27,741 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4) 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Sync replicas to http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/ 2015-03-05 22:05:27,888 ERROR org.apache.solr.cloud.SyncStrategy: Sync Failed:org.apache.solr.common.cloud.ZooKeeperException: Could not find collection in zk: SolrUpgrade_collection [] at org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:600) at org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:588) at org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:583) at org.apache.solr.cloud.SyncStrategy.syncWithReplicas(SyncStrategy.java:152) at org.apache.solr.cloud.SyncStrategy.syncReplicas(SyncStrategy.java:122) at org.apache.solr.cloud.SyncStrategy.sync(SyncStrategy.java:101) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:267) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358) at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Leader's attempt to sync with shard failed, moving to the next candidate 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.ShardLeaderElectionContext: We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.ShardLeaderElectionContext: I am the new leader: http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/ shard2_1 2015-03-05 22:05:27,888 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica2] CLOSING SolrCore org.apache.solr.core.SolrCore@595167 2015-03-05 22:05:27,889 INFO org.apache.solr.update.UpdateHandler: closing DirectUpdateHandler2{commits=1,autocommit maxTime=60000ms,autocommits=0,soft autocommit maxTime=1000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0} 2015-03-05 22:05:27,890 INFO org.apache.solr.update.SolrCoreState: Closing SolrCoreState 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: SolrCoreState ref count has reached 0 - closing IndexWriter 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: closing IndexWriter with IndexWriterCloser 2015-03-05 22:05:27,905 INFO org.apache.solr.core.SolrCore: [SolrUpgrade_collection_shard2_1_replica2] Closing main searcher on request. 2015-03-05 22:05:27,911 INFO org.apache.solr.core.CachingDirectoryFactory: Closing HdfsDirectoryFactory - 2 directories currently being tracked 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index;done=false>>] 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index 2015-03-05 22:05:27,924 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data;done=false>>] 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data 2015-03-05 22:05:27,925 INFO org.apache.solr.store.hdfs.HdfsDirectory: Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data 2015-03-05 22:05:28,026 INFO org.apache.solr.core.CachingDirectoryFactory: Removing directory after core close: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data 2015-03-05 22:05:28,155 INFO org.apache.solr.common.cloud.SolrZkClient: makePath: /collections/SolrUpgrade_collection/leaders/shard2_1 2015-03-05 22:05:28,287 INFO org.apache.solr.common.cloud.ZkStateReader: A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4) {code} This might be because of a missing check at line https://github.com/apache/lucene-solr/blob/lucene_solr_4_10/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java#L138 As the makePath creates all the missing directories recursively. May be we can check whether the collection znode exists before we create the leader znode? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org