Suril Shah created SOLR-13532:
---------------------------------

             Summary: Unable to start core recovery due to timeout in ping 
request
                 Key: SOLR-13532
                 URL: https://issues.apache.org/jira/browse/SOLR-13532
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 7.6
            Reporter: Suril Shah


Discovered following issue with the core recovery:
 * Core recovery is not being initialized and throwing following exception 
message :
{code:java}
2019-06-07 00:53:12.436 INFO  
(recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr 
x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 
r:core_node2778) x:<collection_name>_shard41_replica_n2777 
o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr on 
recovery, try again{code}

 * Above error occurs when ping request takes time more than a timeout period 
which is hard-coded to one second in solr source code. However In a general 
production setting it is common to have ping time more than one second, hence, 
the core recovery never starts and exception is thrown.
 * Also the other major concern is that this exception is logged as an info 
message, hence it is very difficult to identify the error if info logging is 
not enabled.
 * Please refer to following code snippet from the [source 
code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
 to understand the above issue.

{code:java}
      try (HttpSolrClient httpSolrClient = new 
HttpSolrClient.Builder(leaderReplica.getCoreUrl())
          .withSocketTimeout(1000)
          .withConnectionTimeout(1000)
          
.withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
          .build()) {
        SolrPingResponse resp = httpSolrClient.ping();
        return leaderReplica;
      } catch (IOException e) {
        log.info("Failed to connect leader {} on recovery, try again", 
leaderReplica.getBaseUrl());
        Thread.sleep(500);
      } catch (Exception e) {
        if (e.getCause() instanceof IOException) {
          log.info("Failed to connect leader {} on recovery, try again", 
leaderReplica.getBaseUrl());
          Thread.sleep(500);
        }
{code}
The above issue will have high impact in production level clusters, since cores 
not being able to recover may lead to data loss.

Following improvements would be really helpful:
 1. The [timeout for ping 
request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
 in *RecoveryStrategy.java* should be configurable.
 2. The exception message in [line 
797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
 and [line 
801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
 in *RecoveryStrategy.java* should be logged as *error* messages instead of 
*info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to