[ https://issues.apache.org/jira/browse/SOLR-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shalin Shekhar Mangar updated SOLR-10720: ----------------------------------------- Attachment: SOLR-10720.patch > Aggressive removal of a collection breaks cluster state > ------------------------------------------------------- > > Key: SOLR-10720 > URL: https://issues.apache.org/jira/browse/SOLR-10720 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 6.5.1 > Reporter: Alexey Serba > Assignee: Shalin Shekhar Mangar > Priority: Major > Attachments: SOLR-10720.patch > > > We are periodically seeing tricky concurrency bug in SolrCloud that starts > with `Could not fully remove collection: my_collection` exception: > {noformat} > 2017-05-17T14:47:50,153 - ERROR > [OverseerThreadFactory-6-thread-5:SolrException@159] - {} - Collection: > my_collection operation: delete failed:org.apache.solr.common.SolrException: > Could not fully remove collection: my_collection > at > org.apache.solr.cloud.DeleteCollectionCmd.call(DeleteCollectionCmd.java:106) > at > org.apache.solr.cloud.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:224) > at > org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:463) > {noformat} > After that all operations with SolrCloud that involve reading cluster state > fail with > {noformat} > org.apache.solr.common.SolrException: Error loading config name for > collection my_collection > at > org.apache.solr.common.cloud.ZkStateReader.readConfigName(ZkStateReader.java:198) > at > org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:141) > ... > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /collections/my_collection > ... > {noformat} > See full > [stacktraces|https://gist.github.com/serba/9b7932f005f34f6cd9a511e226c6f0c6] > As a result SolrCloud becomes completely broken. We are seeing this with > 6.5.1 but I think we’ve seen that with older versions too. > From looking into the code it looks like it is a combination of two factors: > * Forcefully removing collection's znode in finally block in > [DeleteCollectionCmd|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L115] > that was introduced in SOLR-5135. Note that this causes cached cluster state > to be not in sync with the state in Zk, i.e. > {{zkStateReader.getClusterState()}} still has collection in it (see the code > [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L98]) > whereas {{/collections/<collection_id>}} znode in Zk is already removed. > * Reading cluster state operation not only returns cached version, but it is > also reading collection's config name from {{/collections/<collection_id>}} > znode, but this znode was forcefully removed. The code to read config name > for every collection directly from Zk was introduced in SOLR-7636. Isn't > there any performance implications of reading N znodes (1 per collection) on > every {{getClusterStatus}} call? > I'm not sure what the proper fix should be > * Should we just catch {{KeeperException$NoNodeException}} in > {{getClusterStatus}} and treat such collection as removed? That looks easiest > / less invasive fix. > * Should we stop reading config name from collection znode and get it from > cache somehow? > * Should we not try to delete collection's data from Zk if delete operation > failed? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org