[ 
https://issues.apache.org/jira/browse/IGNITE-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Grimstad reassigned IGNITE-9141:
---------------------------------------

    Assignee: Sergey Grimstad

> SQL: Trace and test query mapping problems
> ------------------------------------------
>
>                 Key: IGNITE-9141
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9141
>             Project: Ignite
>          Issue Type: Task
>          Components: sql
>    Affects Versions: 2.6
>            Reporter: Vladimir Ozerov
>            Assignee: Sergey Grimstad
>            Priority: Major
>             Fix For: 2.7
>
>
> One of mandatory steps of SQL query execution is topology mapping - we need 
> to select nodes where required caches are located, and make sure that their 
> partition distribution is valid for the given SQL query. Once nodes are 
> detected, we try to reserve partitions of interest on mapper nodes to make 
> sure that they will not be evicted during query execution. 
> However, mapping step may fail for many reasons. Most often this is rebalance 
> or concurrent node failures. In this case we simply retry the whole query 
> execution from scratch. In IGNITE-9114 we ensured that retry cycle is not 
> infinite and that root cause of remap is logged. However, original root cause 
> of remap is not propagated to client node making the problem hard to debug 
> for end users. Also we do not have enough tests for remap events. Let's fix 
> this.
> Proposed implementation flow:
> 1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which 
> should be populated along with {{retry}} field on mapper node. See 
> {{GridMapQueryExecutor#sendRetry}} method to understand what may cause 
> retries (failed to reserve partitions or failed to execute non-collocated 
> join). Make sure that these error messages are as verbose as possible with 
> all necessary details (root cause, cache names, affected partitions, etc).
> 2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then 
> propagated to user exception in case of retry timeout.
> 3) Evaluate all places inside 
> {{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
>  which may lead to re-try and make sure that root cause is verbose and 
> propagated to user exception in case of retry timeout. 
> 4) Add tests covering all re-try branches and ensure that query fails after 
> timeout and that error message is correct.
> *NB*: Once propagation of error message to reducer is implemented, we may 
> remove additional logging altogether.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to