[ 
https://issues.apache.org/jira/browse/SOLR-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gerlowski reassigned SOLR-13038:
--------------------------------------

    Assignee:     (was: Jason Gerlowski)

I hope to revisit this soon, but don't have time to focus on it in the 
immediate future.  So I'm removing myself as the assignee.

I still think this is an important issue to fix though, as it's a continuing 
contributor to test flakiness, as well as production behavior.

> Overseer actions fail with NoHttpResponseException following a node restart
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-13038
>                 URL: https://issues.apache.org/jira/browse/SOLR-13038
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Priority: Major
>         Attachments: SOLR-13038.patch
>
>
> I noticed recently that a lot of overseer operations fail if they're executed 
> right after a restart of a Solr node.  The failure returns a message like 
> "org.apache.solr.client.solrj.SolrServerException:IOException occured when 
> talking to server at: https://127.0.0.1:62253/solr";.  The logs are a bit more 
> helpful:
> {code}
> org.apache.solr.client.solrj.SolrServerException: IOException occured when 
> talking to server at: https://127.0.0.1:62253/solr
>     at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657)
>  ~[java/:?]
>     at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
>  ~[java/:?]
>     at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
>  ~[java/:?]
>     at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) 
> ~[java/:?]
>     at 
> org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172)
>  ~[java/:?]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_172]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
>     at 
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>  ~[metrics-core-3.2.6.jar:3.2.6]
>     at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>  ~[java/:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_172]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_172]
>     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
> Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to 
> respond
>     at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
>  ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
>  ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
>  ~[httpcore-4.4.10.jar:4.4.10]
>     at 
> org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
>  ~[httpcore-4.4.10.jar:4.4.10]
>     at 
> org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
>  ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
>  ~[httpcore-4.4.10.jar:4.4.10]
>     at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
>  ~[httpcore-4.4.10.jar:4.4.10]
>     at 
> org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
>  ~[java/:?]
>     at 
> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
>  ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
> ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 
> ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
> ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>  ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>  ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>  ~[httpclient-4.5.6.jar:4.5.6]
>     at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
>  ~[java/:?]
>     ... 12 more
> {code}
> After a bit of debugging I was able to confirm the problem: when some 
> non-overseer node gets restarted, the overseer never notices that its 
> connections are invalid and will try to reuse them for subsequent requests 
> that happen right after the restart.
> There's a few ways we might be able to tackle this:
> * we could look at adding logic to {{SolrHttpRequestRetryHandler}} to retry 
> when this happens.  SHRRH already retries NoHttpResponseException generally, 
> but has other logic which prevents any retries on collection/core-admin APIs. 
>  Maybe we could elaborate this a bit.
> * we could add retry logic to the {{HttpShardHandler}} code that makes these 
> requests.  We could do this across the board, or more selectively for only 
> the overseer commands that are "retry-able".
> * We could tweak how our connection pool is managed so that it evicts these 
> idle connections more aggressively.  It seems like something similar has 
> already been tried (without success) on SOLR-6944
> Not sure what the right approach is.  Seems like intermittent 
> NoHttpResponseExceptions have been a problem in Solr (and its tests) going 
> back at least 5 years or so.  Several JIRAs suggested adding retries for NHRE 
> in the past but have been killed since not all APIs are idempotent and other 
> JIRAs have been concerned with fixing this at the (very broad) SolrClient 
> level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to