[ https://issues.apache.org/jira/browse/SOLR-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749310#action_12749310 ]
Martijn van Groningen commented on SOLR-1143: --------------------------------------------- I think we need to bird's-eye view at the partial results solution, so that we can hook in the partial results behaviour at the right place. This is quiet a long comment, but first I will describe how I think that distributed search works and then propose a solution. In think that this solutions is better than the current one in the patch. >From my understanding the distributed search in the trunk currently works as >follows: 1) When it has been determined that the a search request is a multi shard request an instance of HttpCommComponent is created and outgoing and finished lists are initialised. Also the nextStage is set to zero. 2) The ResponseBuilder's stage is set to the nextStage and the nextStage is set to stage done. The distributedProcess(...) method is invoked on each search component. Each search component can add ShardRequests to the outgoing list in the ResponseBuilder. Besides adding ShardRequests, a search component also returns a stage. The lowest stage from all search components will end up to be the next stage. {code:java} // call all components for( SearchComponent c : components ) { // the next stage is the minimum of what all components report nextStage = Math.min(nextStage, c.distributedProcess(rb)); } {code} 3a) Next step is to send all the ShardRequests from ResponseBuilder's output list to the shards. First a ShardRequest is taken and removed from the ResponseBuilder's output list, then the actual shards are determined for the current ShardRequest. {code:java}ShardRequest sreq = rb.outgoing.remove(0);{code} It checks if for the current overall search request the shards are specified and than use. If that is not the case the predefined shards become the actual shards. {code:java} sreq.actualShards = sreq.shards; if (sreq.actualShards==ShardRequest.ALL_SHARDS) { sreq.actualShards = rb.shards; } {code} 3b)Now that the actual shards are known, a request can be sent to each individual shard. The actual sending of the request is done by the HttpCommComponent.submit(...) method. Before the request is sent, a new SolrParams is constructed based on the overall search request parameters. But with some parameters removed and some parameters added. Then the SolrParams is given to HttpCommComponent.submit(...) method as a argument and is used to create a QueryRequest. In the HttpCommComponent.submit(...) a Callable is instantiated to handle sending request to a shard and receiving a response in an asynchronised manner. In the takes's call() method the actual request (QueryRequest) is created, that will be send to a shard. Also in this method the response is received and if an exception occurred, it is set on the shard response. The callable is then submitted to the completionService's submit method. The submit methods returns a Future that is then added to a set of futures named pending.From my understanding this pending list of futures is only used to keep track of how many request were send and to cancel a request when an exception occurred. 4) When the request are sent for a stage, the next step is to receive the response for each shard request that has been sent. The comm.takeCompletedOrError() returns a shard response. It first checks if an exception was set on the response, if so the search is aborted and the exception is re-thrown. If all went well, then the request of the shard response is added to a list of successful request named finished. After that, the SearchComponent's handleResponses(...) method is invoked that allows the search components to inspect the shard response and perhaps do something with it. The behaviour is repeated until comm.takeCompletedOrError() returns null, which means that all response for the current stage were retrieved. The comm.takeCompletedOrError() handles each response from the shards individually (sub ShardRequest). It uses the completionService's take() method that get a future and uses that to remove that same instance for the pending set. Then the method get is invoked on the future and the response is returned. If the response contains an exception then the response is immediately returned. When the response does not contain a exception it is added to the responses of the ShardRequest. When the number of responses in the ShardRequest is equal to the number of shards then the last response from the get() method of Future is returned (it contains the ShardRequest that contains all the responses). 5) When all request were sent and response were received, on each search component the finishStage(...) method is executed. This allows components to execute some custom logic that is only possible if all shard requests are collected. When that is done it checks if the current stage is not equal to stage done. It then continues with step 2 till 5, until the stage finish is the current stage. That indicates that the distributed search is finished and the response can be written to the client. I think the best way to handle shard failures in my opinion is by not sending a request to a shard that has failed. I think the best way to implement that is by doing the following: 1) Currently ShardRequest has a property actualShards that is a string array of shard host names. Let say we create a Shard data type that contains a string hostname and a boolean failed as properties. The actualShards property will be changed to this Shard data type. 2) In phase 4 when we discover that a ShardRequest failed we need to mark a shard as failed. Therefore the take() or takeCompletedOrError() need store the shard hostname with the exception. In the handleRequestBody we then check if one or more exceptions / hostnames were set, if so we mark those hostnames in ShardRequest as failed. 3) In phase 3b we only invoke HttpCommComponent.submit(...) on the shards that are not marked as failed. Something like this: {code:java} for (Shard shard : sreq.actualShards) { if (shard.hasFailed()) { continue; } ModifiableSolrParams params = new ModifiableSolrParams(sreq.params); params.remove(ShardParams.SHARDS); // not a top-level request params.remove("indent"); params.remove(CommonParams.HEADER_ECHO_PARAMS); params.set(ShardParams.IS_SHARD, true); // a sub (shard) request String shardHandler = req.getParams().get(ShardParams.SHARDS_QT); if (shardHandler == null) { params.remove(CommonParams.QT); else { params.set(CommonParams.QT, shardHandler); } comm.submit(sreq, shard.getHostname(), params); } {code} I think that this approach is much more efficient than the current approach, because no request is sent to the failed shard and thus HttpClient does not try to make a connection to a shard that would not response properly anyway. I think implementing this solution is not that much work. What are your thoughts about this approach? > Return partial results when a connection to a shard is refused > -------------------------------------------------------------- > > Key: SOLR-1143 > URL: https://issues.apache.org/jira/browse/SOLR-1143 > Project: Solr > Issue Type: Improvement > Components: search > Reporter: Nicolas Dessaigne > Assignee: Grant Ingersoll > Fix For: 1.4 > > Attachments: SOLR-1143-2.patch, SOLR-1143-3.patch, SOLR-1143.patch > > > If any shard is down in a distributed search, a ConnectException it thrown. > Here's a little patch that change this behaviour: if we can't connect to a > shard (ConnectException), we get partial results from the active shards. As > for TimeOut parameter (https://issues.apache.org/jira/browse/SOLR-502), we > set the parameter "partialResults" at true. > This patch also adresses a problem expressed in the mailing list about a year > ago > (http://www.nabble.com/partialResults,-distributed-search---SOLR-502-td19002610.html) > We have a use case that needs this behaviour and we would like to know your > thougths about such a behaviour? Should it be the default behaviour for > distributed search? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.