[ https://issues.apache.org/jira/browse/IGNITE-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Grimstad reassigned IGNITE-9141: --------------------------------------- Assignee: Sergey Grimstad > SQL: Trace and test query mapping problems > ------------------------------------------ > > Key: IGNITE-9141 > URL: https://issues.apache.org/jira/browse/IGNITE-9141 > Project: Ignite > Issue Type: Task > Components: sql > Affects Versions: 2.6 > Reporter: Vladimir Ozerov > Assignee: Sergey Grimstad > Priority: Major > Fix For: 2.7 > > > One of mandatory steps of SQL query execution is topology mapping - we need > to select nodes where required caches are located, and make sure that their > partition distribution is valid for the given SQL query. Once nodes are > detected, we try to reserve partitions of interest on mapper nodes to make > sure that they will not be evicted during query execution. > However, mapping step may fail for many reasons. Most often this is rebalance > or concurrent node failures. In this case we simply retry the whole query > execution from scratch. In IGNITE-9114 we ensured that retry cycle is not > infinite and that root cause of remap is logged. However, original root cause > of remap is not propagated to client node making the problem hard to debug > for end users. Also we do not have enough tests for remap events. Let's fix > this. > Proposed implementation flow: > 1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which > should be populated along with {{retry}} field on mapper node. See > {{GridMapQueryExecutor#sendRetry}} method to understand what may cause > retries (failed to reserve partitions or failed to execute non-collocated > join). Make sure that these error messages are as verbose as possible with > all necessary details (root cause, cache names, affected partitions, etc). > 2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then > propagated to user exception in case of retry timeout. > 3) Evaluate all places inside > {{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}} > which may lead to re-try and make sure that root cause is verbose and > propagated to user exception in case of retry timeout. > 4) Add tests covering all re-try branches and ensure that query fails after > timeout and that error message is correct. > *NB*: Once propagation of error message to reducer is implemented, we may > remove additional logging altogether. -- This message was sent by Atlassian JIRA (v7.6.3#76005)