Suril Shah created SOLR-13532: --------------------------------- Summary: Unable to start core recovery due to timeout in ping request Key: SOLR-13532 URL: https://issues.apache.org/jira/browse/SOLR-13532 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: SolrCloud Affects Versions: 7.6 Reporter: Suril Shah
Discovered following issue with the core recovery: * Core recovery is not being initialized and throwing following exception message : {code:java} 2019-06-07 00:53:12.436 INFO (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 r:core_node2778) x:<collection_name>_shard41_replica_n2777 o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr on recovery, try again{code} * Above error occurs when ping request takes time more than a timeout period which is hard-coded to one second in solr source code. However In a general production setting it is common to have ping time more than one second, hence, the core recovery never starts and exception is thrown. * Also the other major concern is that this exception is logged as an info message, hence it is very difficult to identify the error if info logging is not enabled. * Please refer to following code snippet from the [source code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] to understand the above issue. {code:java} try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl()) .withSocketTimeout(1000) .withConnectionTimeout(1000) .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient()) .build()) { SolrPingResponse resp = httpSolrClient.ping(); return leaderReplica; } catch (IOException e) { log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl()); Thread.sleep(500); } catch (Exception e) { if (e.getCause() instanceof IOException) { log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl()); Thread.sleep(500); } {code} The above issue will have high impact in production level clusters, since cores not being able to recover may lead to data loss. Following improvements would be really helpful: 1. The [timeout for ping request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791] in *RecoveryStrategy.java* should be configurable. 2. The exception message in [line 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797] and [line 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801] in *RecoveryStrategy.java* should be logged as *error* messages instead of *info* messages -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org