[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 32: Code-Review+2 Carrying +2. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 32 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 20:10:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry * QueryDrivers and ClientRequestStates are now referenced using a QueryHandle Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Reviewed-on: http://gerrit.cloudera.org:8080/14824 Tested-by: Impala Public Jenkins Reviewed-by: Sahil Takiar --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 27 files changed, 2,700 insertions(+), 696 deletions(-) Approvals: Impala Public Jenkins: Verified Sahil Takiar: Looks good to me, approved
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 32: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 32 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 20:06:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 32: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6082/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 32 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 15:26:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 32: (1 comment) Fixing a few issues that are causing the test failures: (1) Removed a DCHECK (see comment below) (2) Fixed a bug where GetQueryHandle was not passing return_unregistered to GetQueryDriver Ran a bunch of more tests locally, and things look good: * Re-ran the stress tests for tpcds and tpch * Looped the new tests overnight and no failures Did some additional cleanup of code comments, removing unnecessary imports, minor-refactoring, etc. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511 PS26, Line 511: if (!failed_backend_states.empty()) { > Done I had to remove the DCHECK because it caused a crash. Turns out that because of the IsAborted() check above, its possible for this list to be empty. Tests that use the EXEC_SERIALIZE_FRAGMENT_INFO Debug Action trigger the DCHECK. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 32 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 14:47:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 32: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5845/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 32 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 14:44:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#32). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry * QueryDrivers and ClientRequestStates are now referenced using a QueryHandle Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 27 files changed, 2,700 insertions(+), 696 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/32 -- To view, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 31: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5839/ -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 31 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 08:27:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 31: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6077/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 31 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 04:01:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#31). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry * QueryDrivers and ClientRequestStates are now referenced using a QueryHandle Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 27 files changed, 2,700 insertions(+), 696 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/31 -- To view, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 31: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5839/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 31 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 03:09:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 30: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5836/ -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 30 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 15 May 2020 01:22:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 29: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6067/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 29 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 14 May 2020 21:41:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 30: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5836/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 30 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 14 May 2020 20:56:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 30: Code-Review+2 Re-based, carrying +2. I'm going to run the pre-commit job to make sure it passes, but I'm also going run the stress tests again locally to make sure they still pass. I'm also going to do one final read-through of the code to make sure we didn't miss anything. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 30 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 14 May 2020 20:56:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 29: (5 comments) http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58 PS27, Line 58: return *request_state_; > Sure, I mean there are basically two cases for how this might be used: I think I see what your saying now. Yeah, your suggestion makes sense since the caller can always decide if they want to make a copy. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// single query submission. > > I think in that case the error still ahh I see what you are saying now - right, the code path in UpdateBackendExecStatus where AuxErrorInfoPB is present, but no error is returned from the code path. the 'retryable_status' gets dropped after the call to TryQueryRetry. looks like this happens in CancelFromThreadPool and HandleFailedExecRpcs as well. filed IMPALA-9748 to follow up on this http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc@35 PS28, Line 35: Finalize() must > Finalize() Done http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc@150 PS28, Line 150: if (!register_status.ok()) { > You could also move the call to CreateClientRequestState into CreateNewDriv Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177 PS27, Line 1177: in > In that case, a non-const QueryHandle& or a QueryHandle*. Not a big deal th Actually, your original suggestion does work. The fact that QueryHandle is a const doesn't really matter because "->" returns a non-const pointer to ClientRequestState. So I re-factored CloseClientRequestState, UpdateExecSummary, and ArchiveQuery to all take in a const QueryHandle&. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 29 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 14 May 2020 20:49:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#29). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 27 files changed, 2,703 insertions(+), 696 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/29 -- To view, visit http://gerrit.cloudera.org:8080/14824 To
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 28: Code-Review+2 (5 comments) http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58 PS27, Line 58: inline ClientRequestState* o > isn't the point here that the pointer returned by query_driver() should be Sure, I mean there are basically two cases for how this might be used: - Most of the uses of this in the current patch are calls like "query_handle.query_driver()->SomeQueryDriverFn()", in which case the QueryHandle is going to be in scope for the whole call and making this a "const&" saves an atomic inc/dec without any risk. - If a developer wants to make a local variable like "shared_ptr = query_handle.query_driver()", making this a "const&" gives them the option of whether or not to make a copy (i.e. whether the local variable itself is declared a const& or not). This introduces a little risk, cause a dev could not make a copy in a situation where they actually needed one. Its different than the case with adding the operator-> since there we were returning a bare pointer to a ClientRequestState which doesn't allow the caller to get any guarantees on its lifetime even if they want. Obviously this isn't really super perf critical code, so its not a big deal. Feel free to leave as is. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// id, which will be distinct from the query id of the originally submitted query that > I think in that case the error still > gets propagated to the the ClientRequestState::query_status_ > though, which is why the errors show up in the runtime profile. I think that's true in some cases but not in other cases, but its fine. http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc@35 PS28, Line 35: StartUnregister Finalize() http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc@150 PS28, Line 150: query_driver->CreateClientRequestState(query_ctx, session, _handle); You could also move the call to CreateClientRequestState into CreateNewDriver and then I think it wouldn't need to be public anymore and you wouldn't need to define the 'query_driver' variable on the line before this, and then pass the 'query_handle' into RegisterQuery which is nice for being consistent about passing QueryHandles everywhere. Fine to leave as is though http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177 PS27, Line 1177: _m > CloseClientRequestState requires a bunch of non-const access - e.g. it pass In that case, a non-const QueryHandle& or a QueryHandle*. Not a big deal though -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 28 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 13 May 2020 18:01:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 28: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6045/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 28 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 12 May 2020 23:53:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 27: (25 comments) > I know I've asked for a lot, but I think the patch is a lot better now. Not a problem, thanks for all the thorough feedback. Don't think this patch would be nearly as clean without all the comments. > I still have a lot of concerns about this feature, esp. about our story > around observability, but I think those concerns can mostly be addressed in > follow up work and most of my remaining comments are more minor things. Right. I don't think the expectation should be that this feature is complete by any means. I'm not sure how feature development has been done historically in Impala, but IMO I think features need to be built piece-meal over multiple-commits, before they can really be used. I won't rant too long about this, but I just think that getting an initial version that users can at least play around with gives us a level of feedback and testing that can't be done within a scoped design review session / single code review. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079 PS26, Line 1079: > I don't think that's true - ClientRequestState::UpdateQueryStatus() is never > called from inside Coordinator. I guess, a '-->' wasn't the correct way to describe this. StartBackendExec() returns an error to Exec() which returns an error to FinishExecQueryOrDmlRequest() which then calls UpdateQueryStatus(). > Now, MarkAsRetrying() will set an error on ClientRequestState but after that > Coordinator::Exec() will still return an error which ClientRequestState will > try to set on itself with UpdateQueryStatus() but it'll get discarded because > an error was already set. Right. > The point is, its messy and I'm concerned it will be error prone. I agree that it is more complex. I'm not sure I understand why it is particularly more error prone. It's not that I don't want to take your suggestion, I just think at this point it is better to address this separately. My hesitation is that doing this re-factoring now is going to introduce bugs into the patch unnecessarily. From experience when first iterating on this patch, doing seemingly harmless re-factoring has introduced a lot of bugs. So I'm not really convinced it is worth the effort at this point in time. Taking a step back, I think at this point in the feature lifecyle, its not worth investing a huge amount of time on this. The feature itself it still immature, and subject to change. It's hard to predict exactly what it will look like after all the planned enhancement, so a lot of this debate might end up being a moot point. I am taking note of this feedback though, and plan to track it in IMPALA-9370. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910 PS27, Line 910: Status status = Status::OK(); > I don't think declaring this here is necessary since its not used outside o Done http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58 PS27, Line 58: std::shared_ptr > I think you might want to make this a '&' to avoid unnecessary copies isn't the point here that the pointer returned by query_driver() should be able to out-live the QueryHandle instance? in which case you would need to take a shared_ptr copy to ensure proper ref counting? at least, I thought that was part of the argument for creating the whole "->" and "*" operator overloading. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93 PS27, Line 93: TryRetryQuery > TryQueryRetry Done http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// > Right, but MarkAsRetrying doesn't get called in the case where you call > AddDetail() because TryQueryRetry returns immediately after AddDetail() Whoops, yeah you are right. I think in that case the error still gets propagated to the the ClientRequestState::query_status_ though, which is why the errors show up in the runtime profile. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228 PS27, Line 228: // Run the new query. > I think its fine to leave for now, but just wanted to note that the duplica
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#28). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 27 files changed, 2,692 insertions(+), 682 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/28 -- To view, visit http://gerrit.cloudera.org:8080/14824 To
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 27: (25 comments) Thanks for the changes. I know I've asked for a lot, but I think the patch is a lot better now. I still have a lot of concerns about this feature, esp. about our story around observability, but I think those concerns can mostly be addressed in follow up work and most of my remaining comments are more minor things. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079 PS26, Line 1079: > Before this patch, if a query fails: > ClientRequestState::FinishExecQueryOrDmlRequest() > --> Coordinator::Exec() --> Coordinator::StartBackendExec() --> > ClientRequestState::UpdateQueryStatus() which sets > ClientRequestState::query_status_. I don't think that's true - ClientRequestState::UpdateQueryStatus() is never called from inside Coordinator. If an Exec() rpc fails, before this patch Coordinator::Exec() will return an error status and ClientRequestState will call UpdateQueryStatus() on itself to set the error that's returned from Exec(). Now, MarkAsRetrying() will set an error on ClientRequestState but after that Coordinator::Exec() will still return an error which ClientRequestState will try to set on itself with UpdateQueryStatus() but it'll get discarded because an error was already set. As far as I can tell, prior to this patch if the Coordinator needed to communicate an error to the ClientRequestState it always did so by returning an error Status, eg. if a backend reports an error usually that gets bubbled up when the ClientRequestState calls GetNext(), and now we've added this other patch where the ClientRequestState calls into the Coordinator and during that call the Coordinator calls back into the ClientRequestState to set the error even though its also about to return another Status that represents the same error with a slightly different text. The point is, its messy and I'm concerned it will be error prone. > A lot of the re-factoring we end up > doing now might just be wasted work if we end up changing half the > logic. Sure, you don't have to take my suggestions, esp. if you have a good reason/a vision of your own for how to clean this all up. I'm trying to help make sure this patch is well designed because I think it'll make the further work easier, eg. minimizing the amount of retry related logic that ends up in Coordinator should make it easier to make further changes to how retries work. If you're really determined to keep the calls to TryQueryRetry in Coordinator in this patch, its fine with me. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910 PS27, Line 910: Status status = Status::OK(); I don't think declaring this here is necessary since its not used outside of the if/else immediately below http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58 PS27, Line 58: std::shared_ptr I think you might want to make this a '&' to avoid unnecessary copies http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93 PS27, Line 93: TryRetryQuery TryQueryRetry http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// > TryQueryRetry passes the status to > MarkAsRetrying which sets the given status as the query_status_ of > the ClientRequestState, which is what gets exposed in the runtime > profile. Right, but MarkAsRetrying doesn't get called in the case where you call AddDetail() because TryQueryRetry returns immediately after AddDetail() http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228 PS27, Line 228: // Run the new query. I think its fine to leave for now, but just wanted to note that the duplication of the query execution logic between here and ImpalaServer is unfortunate and it might be nice to have more of a QueryDriver::RunQuery() type function that works for both initial attempts and retries when we're doing refactoring. http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h File be/src/service/client-request-state.h:
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 27: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6041/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 27 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 12 May 2020 17:52:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#27). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 27 files changed, 2,634 insertions(+), 675 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/27 -- To view, visit http://gerrit.cloudera.org:8080/14824 To
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 26: (8 comments) http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511 PS26, Line 511: if (!failed_backend_states.empty()) { Seems like it shouldn't be possible for this to ever be false. Maybe add a DCHECK? http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@512 PS26, Line 512: RETURN_IF_ERROR I don't think its possible for HandleFailedExecRpcs() to return an error, and even if it did it probably wouldn't be correct to just return like this here since we still need to call UpdateExecState() below to update the query to an error status. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@982 PS26, Line 982: parent_query_driver_->TryQueryRetry(parent_request_state_, _status); Probably fine to leave as is for now since currently any query that hits this is going to end up failing, but just wanted to point out that this code path means that we could potentially cancel the query without having actually received an error status back yet. As noted elsewhere, this makes it more difficult to use the AddDetail() on the Status out parameter in TryQueryRetry. http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079 PS26, Line 1079: parent_query_driver_->TryQueryRetry(parent_request_state_, _status); Wanted to point out that there's an unfortunate situation here and elsewhere TryQueryRetry is called where we end up with two slightly different versions for the overall query status - the one that gets set on Coordinator (in this case, exec_rpc_status, which represents the error from the first backend to fail in Exec()) and 'retryable_status' which gets set on ClientRequestState in MarkAsRetrying(). Normally, those two statuses would be the same, because ClientRequestState would get Coordinator's state returned to it from Exec(), which it would then set on itself, it won't set it anymore as we only preserve the first error message. I think that's messy and has the potential to be confusing for developers. Personally, I still really think it would be worth it to do the work to at least somewhat remove the circular dependency here, eg. you could move the call to TryQueryRetry to ClientRequestState::UpdateQueryStatus() along with defining a Coordinator::BlacklistedExecutors() or something like that, It wouldn't be much work, it would solve this problem, and it would make the control flow a lot cleaner, make Coordinator better encapsulated since it wouldn't need to know anything about retries, etc. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// query option 'retry_failed_queries') and if the query can be retried. Queries can > Doesn't return a Status anymore. Sure, 'status' is used in one code path that calls TryQueryRetry (the code path in Coordinator where we get an error in a status report, which test_multiple_retries is testing), but its not used in any of the other code paths that call TryQueryRetry so it gets lost in those cases (eg. HandleFailedExecRpcs) One option would be to just fix those code paths to actually set the status they passed in as the out parameter to the overall error status, but that's not always going to be easy (eg. the path in Coordinator where we get an AuxErrorInfo but no overall query error status in the report, so we have no error status to add the details to). Instead, maybe it would be better to just add an info string to the profile that says "This query was not retried because..." One disadvantage of that is that it means you can't immediately tell from the overall error why the query wasn't retried and have to dig down into the profile/logs, but that's already the case since you're only adding the detail if we hit the max retries and not for other cases such as if rows had already been fetched, though maybe you want to do an AddDetail for those cases too. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204 PS24, Line 204: /// 'RunFrontendPlanner'. > Do you mean that CreateRetriedClientRequestState could just pass in a point Personally, I think the code is way more confusing this way. You're not "moving" the TExecRequest from the original ClientRequestState to the new one - the TExecRequest is just owned by the QueryDriver, so you're moving it from one place in the QueryDriver to a different place in the QueryDriver. It doesn't change at any point,
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 26: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6012/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 26 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 08 May 2020 20:07:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 25: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6011/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 25 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 08 May 2020 20:04:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 25: (20 comments) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc@937 PS24, Line 937: parent_query_dri > This isn't valid. (and you have similar issues elsewhere) Done http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@1 PS24, Line 1: // Licensed to the Apache Software Foundation (ASF) under one > The distinction isn't that clear, but it might make more sense to put this I think client-request-state probably belongs in the runtime folder. I plan to move some of this around in IMPALA-9370. Left a comment to follow up on this in IMPALA-9370. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@39 PS24, Line 39: QueryHandle > I think this name is confusing, esp. since there's already a QueryHandle in I like referring to this as a handle, since its really just a pointer into other query level objects. I think some of these re-factoring / re-renmaing decisions should be done in IMPALA-9370 so that we can tackle them all together and ensure that all class names are consistent with one another. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@82 PS24, Line 82: /// *Transparent* Query Retries: > Not a huge deal in this patch since it's still hidden behind a flag and exp That is a good point. I added some notes here and in the uber-JIRA as a disclaimer. It's a bit hard to change all the places where this is referred to as "transparent query retries" and I think many folks are already using the phrase. So I think at this point it is a matter of setting proper expectations. I didn't explicitly mention the GetRuntimeProfile() changes so we are planning to revise the behavior in IMPALA-9229 Yeah - I've tested this with Hue, and it works as expected. There are a few observability improvements that could be made to Hue to improve this. The uber-JIRA (IMPALA-9124) has some sub-tasks related to impala-shell usability improvements. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: /// query option 'retry_failed_queries') and if the query can be retried. Queries can > The way you're both returning a Status and also returning some status infor Doesn't return a Status anymore. The input Status is used to set the error message "Max retry limit was hit" which does show up in the final error message (see TestQueryRetries::test_multiple_retries) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@189 PS24, Line 189: owns this QueryDriver. > ? Forgot to remove this. Deleted. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@194 PS24, Line 194: : /// The ClientRequestState > ? Forgot to remove this. Deleted. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204 PS24, Line 204: /// 'RunFrontendPlanner'. > Now that this is just move()-ed from 'exec_request_' I'm not sure why its s Do you mean that CreateRetriedClientRequestState could just pass in a pointer to exec_request_ rather than retry_exec_request_ when creating the new ClientRequestState? I guess we could do that, although I think its a little easier to understand the code in its current form. you move the TExecRequest out from the original ClientRequestState into the new one. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@97 PS24, Line 97: if (client_request_state->fetched_rows()) { > This function would be more readable if you turned these into: Done http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@302 PS24, Line 302: try_request > Instead of this make_unique/move() thing, why not just have SetOriginalId() Done http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884 PS24, Line 884: QueryHandle query_handle; > I'm not sure that in its current state QueryHandle is really doing much. Th I think the point of the QueryHandle is to enforce that whenever users use a ClientRequestState, that the shared_ptr to the corresponding QueryDriver is in scope. QueryHandle might not be the perfect way to do this, but I think its better than the previous solution where you had to make sure the shared_ptr
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#26). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,482 insertions(+), 513 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/26 -- To view, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 25: (4 comments) http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@642 PS25, Line 642: Status status = GetQueryHandle(query_id, _handle, /*return_unregistered=*/ true); line too long (91 > 90) http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@644 PS25, Line 644: ClientRequestState* request_state = query_handle.request_state; line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@710 PS25, Line 710: Status status = GetQueryHandle(query_id, _handle, /*return_unregistered=*/ true); line too long (91 > 90) http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@712 PS25, Line 712: ClientRequestState* request_state = query_handle.request_state; line has trailing whitespace -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 25 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 08 May 2020 19:14:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#25). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,480 insertions(+), 513 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/25 -- To view, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 24: (16 comments) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc@937 PS24, Line 937: RETURN_IF_ERROR( This isn't valid. (and you have similar issues elsewhere) As far as I can tell, the only way that this returns is if we decide to retry the query and then Thread::Create returns an error, but even in that case we need to still finish this function and eg. call UpdateExecState below with the original error status. In general, errors related to failing to retry aren't ever going to be overall query errors, and we'll always need to finish whatever processing we were doing for the original query, so probably all we need to do is log any errors from TryQueryRetry and move on. It may even be best to log the errors from within TryQueryRetry and not have it even return anything. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@1 PS24, Line 1: // Licensed to the Apache Software Foundation (ASF) under one The distinction isn't that clear, but it might make more sense to put this in /be/src/service, since right now its: impala-server (src/service) -> query-driver (src/runtime) -> client-request-state (src/service) -> coordinator (src/runtime) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@39 PS24, Line 39: QueryHandle I think this name is confusing, esp. since there's already a QueryHandle in beeswax. Its really more of a ClientRequestStateHandle, though that's wordy. Maybe CRSHandle? ClientRequestHandle? It might even be worth renaming ClientRequestState, since if anything the QueryDriver is what really holds the current state of the client request. Maybe QueryInstance? Not a huge deal though http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@82 PS24, Line 82: /// *Transparent* Query Retries: Not a huge deal in this patch since it's still hidden behind a flag and experimental, but I think that we should be cautious about saying that this is completely transparent, since it does change what clients see, potentially in ways that could break existing clients (eg. if you call GetRuntimeProfile() and then do an assert that the returned profile has the same query_id as what you requested, turning this on could cause that assert to fail). Obviously at a minimum we need to test this with all the usual known clients (impala-shell, impyla, jdbc/odbc, Hue, etc.), and we'll probably be making changes to at least some of those clients for the sake of observability around this (eg. impala-shell might want to print a message about the query getting retried) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136 PS24, Line 136: Status TryQueryRetry(ClientRequestState* client_request_state, Status* status, The way you're both returning a Status and also returning some status information in a Status out parameter seems confusing, and in fact it looks like at most of the call sites of this function the Status out parameter is never actually referenced again after the call so any info added to it is usually dropped. I think its probably better to just not have 'status' be an out parameter and return use the returned status for everything, or possibly not return any status info at all (see my other comments) http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@189 PS24, Line 189: A shared_ptr is used to allow asynchronous deletion ? http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@194 PS24, Line 194: A shared_ptr is used to allow : /// asynchronous deletion. ? http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204 PS24, Line 204: std::unique_ptr retry_exec_request_; Now that this is just move()-ed from 'exec_request_' I'm not sure why its still needed. http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@97 PS24, Line 97: if (!client_request_state->fetched_rows()) { This function would be more readable if you turned these into: if (reason_not_to_retry) { LOG << reason; return; } if (another_reason) { ... (after doing that, I personally would get rid of RetryAsync() as 1) its only called in one place 2) a lot of its code is just DCHECK-ing things that the ifs here enforce, so that could just be removed and 3) the names
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 24: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5990/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 24 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 07 May 2020 15:50:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#24). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,458 insertions(+), 510 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/24 -- To view, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 23: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/5983/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 23 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 07 May 2020 01:06:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 23: Fixed compilation issue. Rebased against and resolved some conflicts on top of IMPALA-9692. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 23 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 07 May 2020 00:54:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#22). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,458 insertions(+), 510 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/22 -- To view, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 21: Code-Review+1 (2 comments) This is looking good to me. http://gerrit.cloudera.org:8080/#/c/14824/21/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/21/be/src/runtime/coordinator.cc@1071 PS21, Line 1071: transform(failed_backend_states.begin(), failed_backend_states.end(), : back_inserter(backend_addresses), [](BackendState* backend_state) -> string { : return TNetworkAddressToString(backend_state->krpc_impalad_address()); : }); Purely style nit: I think this would be cleaner as a for-loop. for (BackendState* backend_state : failed_backend_states) { backend_addresses.push_back(TNetworkAddressToString(...)); } http://gerrit.cloudera.org:8080/#/c/14824/21/be/src/runtime/coordinator.cc@1077 PS21, Line 1077: for_each(failed_backend_states.begin(), failed_backend_states.end(), : [_status](BackendState* backend_state) { : retryable_status.MergeStatus( : FromKuduStatus(backend_state->exec_rpc_status(), "Exec() rpc failed")); : }); Purely style nit: I think this would be cleaner as a for-loop: for (BackendState* backend_state : failed_backend_states) { retryable_status.MergeStatus(FromKuduStatus(...), ...); } -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 21 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 23:14:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 21: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5973/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 21 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 03:56:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 19: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5972/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 19 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 03:53:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 19: Rebased on top of IMPALA-9380, which had a bunch of merge conflicts that I had to resolve. Re-ran the tests from IMPALA-9380 and they pass. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 19 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 03:13:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#21). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,445 insertions(+), 510 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/21 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 19: (2 comments) http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1153 PS19, Line 1153: > line has trailing whitespace Done http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1310 PS19, Line 1310: RETURN_IF_ERROR(query_handle.request_state->Cancel(/*check_inflight=*/ true, /*cause=*/ nullptr)); > line too long (100 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 19 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 03:12:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 19: (2 comments) http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1153 PS19, Line 1153: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1310 PS19, Line 1310: RETURN_IF_ERROR(query_handle.request_state->Cancel(/*check_inflight=*/ true, /*cause=*/ nullptr)); line too long (100 > 90) -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 19 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 03:10:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#19). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,444 insertions(+), 510 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/19 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 18: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5971/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 18 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 02:10:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 17: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5970/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 17 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 02:09:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#18). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,372 insertions(+), 515 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/18 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 17: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h@249 PS17, Line 249: virtual void Cancel(impala::TStatus& status, const beeswax::QueryHandle& beeswax_handle); > line too long (91 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 17 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 01:25:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 17: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h@249 PS17, Line 249: virtual void Cancel(impala::TStatus& status, const beeswax::QueryHandle& beeswax_handle); line too long (91 > 90) -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 17 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 01:24:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#17). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,370 insertions(+), 514 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/17 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 16: (4 comments) http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h@248 PS16, Line 248: A shared_ptr is used because the : /// Coordinator can outlive its associated QueryDriver. > This comment needs updating Done http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h@607 PS16, Line 607: A shared_ptr is used because the : /// ClientRequestState can outlive its associated QueryDriver. > Update this comment (or remove) Done http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc@212 PS16, Line 212: query_handle; > Nit: It would be nice to standardize the naming for variables/arguments for Done http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h@933 PS16, Line 933: void TUniqueIdToQueryHandle(const TUniqueId& query_id, beeswax::QueryHandle* handle); : void QueryHandleToTUniqueId(const beeswax::QueryHandle& handle, TUniqueId* query_id); > Nit: Should we rename this to emphasize that this is a beeswax::QueryHandle Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 16 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Wed, 06 May 2020 01:22:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 16: (4 comments) Took another pass. It is mostly nits about beeswax::QueryHandle vs impala::QueryHandle and some stale comments. http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h@248 PS16, Line 248: A shared_ptr is used because the : /// Coordinator can outlive its associated QueryDriver. This comment needs updating http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h@607 PS16, Line 607: A shared_ptr is used because the : /// ClientRequestState can outlive its associated QueryDriver. Update this comment (or remove) http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc@212 PS16, Line 212: query_handle; Nit: It would be nice to standardize the naming for variables/arguments for beeswax::QueryHandle vs impala::QueryHandle across the functions in this file. e.g. the beeswax::QueryHandle is always 'handle' or 'beeswax_handle' or whatever and the impala::QueryHandle is always 'query_handle' or 'impala_query_handle' or whatever. I think I would lean toward beeswax::QueryHandle -> 'beeswax_handle' and impala::QueryHandle -> 'query_handle', but I'm flexible. This variability predates your change, but I don't think there is anything stopping us from standardizing. http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h@933 PS16, Line 933: void TUniqueIdToQueryHandle(const TUniqueId& query_id, beeswax::QueryHandle* handle); : void QueryHandleToTUniqueId(const beeswax::QueryHandle& handle, TUniqueId* query_id); Nit: Should we rename this to emphasize that this is a beeswax::QueryHandle? Something like: TUniqueIdToBeeswaxHandle() BeeswaxHandleToTUniqueId() -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 16 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 05 May 2020 23:38:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 16: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5941/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 16 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 04 May 2020 01:52:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169 PS13, Line 169: _state_; > I should have mentioned that the shared_ptr approach also requires special i decided just to create a struct called QueryHandle that contains a shared_ptr to the QueryDriver and a naked pointer to the ClientRequestState. I reviewed all the places where a QueryDriver / ClientRequestState are used and make sure the ClientRequestState is only used when the QueryDriver is still in scope. I used new QueryDriver struct where I thought it was useful (e.g. ImpalaServer::GetQueryHandle). i added some docs that explain the ClientRequestState should only be used when a valid QueryDriver is in scope. since the QueryDriver must be in scope while the ClientRequestState is being used, I changed the ownership of the ClientRequestState back to unique_ptr and changed the Get*ClientRequestState() methods to return a naked pointer. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763 PS9, Line 763: ClientRequestState* request_state = nullptr; > yeah, thats a good point, thanks for catching that. changed it to use a sha i ended it up changing it back to unique_ptr. I changed the semantics a bit so that whenever you use a ClientRequestState, the corresponding QueryDriver shared_ptr must have an active reference on the stack. I added a new struct called QueryHandle that contains a shared_ptr to a QueryDriver and a naked pointer to a ClientRequestState. the idea is that the QueryHandle will help insure a QueryDriver reference is alive when accessing a ClientRequestState. there are a few places where it didn't make sense to use the new QueryHandle, so I just spot checked that whenever a ClientRequestState is used, the corresponding QueryDriver is still in scope. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 9 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 04 May 2020 01:06:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#16). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,328 insertions(+), 473 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/16 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169 PS13, Line 169: > Fixed. The Coordinator and ClientRequestState now take a shared_ptr to the I should have mentioned that the shared_ptr approach also requires special teardown. We need to break the cycle of shared_ptrs, so there isn't a resource leak. If A holds a shared_ptr to B and B holds a shared_ptr to A, they will both continue to exist even if the last external shared_ptr is destructed. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 13 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 01 May 2020 22:52:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 15: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5928/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 15 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 01 May 2020 22:17:58 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5927/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 14 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 01 May 2020 22:16:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 14: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc@163 PS14, Line 163: ::RetryQueryFromThread, this, error, query_driver, _query_thread_)); > line too long (92 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 14 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 01 May 2020 21:34:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#15). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,301 insertions(+), 427 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/15 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 15 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer:
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 13: (5 comments) addressed comments, fixed a possible race condition when a query gets cancelled and unregistered while a retry is being setup. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169 PS13, Line 169: > ClientRequestState has a pointer back to the QueryDriver (as does the Coord Fixed. The Coordinator and ClientRequestState now take a shared_ptr to the QueryDriver http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@161 PS13, Line 161: void QueryDriver::RetryQueryFromThread(const Status& error) { > Is there anything that prevents the QueryDriver from being deleted while th hmm yeah that's true. I changed it so that RetryQueryFromThread takes in a shared_ptr to the current QueryDriver. its a bit odd that the method has to take in a reference to itself, but I couldn't see a better way. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@178 PS13, Line 178: // Cancel the query. > Is this necessary? At least in the case of Coordinator calling TryQueryRetr I don't think 'ImpalaServer::CancelFromThreadPool' (which calls TryQueryRetry() will call Cancel. Its probably possible to remove this, but I thought it was a nice property that you want to ensure the query is cancelled before trying to retry it. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@179 PS13, Line 179: true, nullptr > Both of these parameters seem wrong - you note above that a query could get I changed the 'check_inflight' parameter to false. ClientRequestState::MarkAsRetrying explicitly sets the query_status_ in the ClientRequestState. The issue with calling Cancel with a cause is that it will call UpdateQueryStatus, which sets the ExecState to ERROR, which you don't want because the ERROR state will get exposed to the client, which you don't want to happen since the query is being retried. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@281 PS13, Line 281: // TODO: IMPALA-9502: Avoid copying TExecRequest when retrying queries > Out of curiosity, what are the issues here? hmm can't remember, I replaced this with 'move(exec_request_)' and all the tests and a run of the stress test passes. I added this a long time ago on a much older version of this feature, so it is possible it got fixed as some point inadvertently. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 13 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 01 May 2020 21:33:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 14: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc@163 PS14, Line 163: ::RetryQueryFromThread, this, error, query_driver, _query_thread_)); line too long (92 > 90) -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 14 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 01 May 2020 21:32:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#14). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,300 insertions(+), 427 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/14 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 14 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer:
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169 PS13, Line 169: ClientRequestState has a pointer back to the QueryDriver (as does the Coordinator), so if ClientRequestState outlives the QueryDriver, then that pointer would be dangling. I think the simplest lifetime semantic is for the QueryDriver to live as long as any child ClientRequestStates is alive. I think we are pretty close to that being true. Most code looks up a QueryDriver and then looks up the ClientRequestState with the QueryDriver shared_ptr staying on the stack. One option is to formalize it and have a struct that contains a shared_ptr to the QueryDriver and then a pointer to ClientRequestState. If everything uses that struct when operating on a ClientRequestState, then the QueryDriver will always outlive its children ClientRequestStates. This is just a sketch, so I might be missing something. An implicit way to do this is to have a shared_ptr from ClientRequestState to the QueryDriver. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 13 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 01 May 2020 17:48:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 13: (4 comments) Still going through it http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@161 PS13, Line 161: void QueryDriver::RetryQueryFromThread(const Status& error) { Is there anything that prevents the QueryDriver from being deleted while this thread is running? eg. what if the query fails and then while the retry is happening the user cancels the query? Mostly we're relying on the shared_ptr to prevent those sorts of issues, but I don't think we're holding a shared_ptr for the duration of this, since you have to just pass a bare 'this' to Thread::Create http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@178 PS13, Line 178: // Cancel the query. Is this necessary? At least in the case of Coordinator calling TryQueryRetry() we'll end up calling cancel from the Coordinator too when the status is updated to an error. I guess its fine because Cancel() is idempotent, but might be worth thinking about what we can do to make this cleaner. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@179 PS13, Line 179: true, nullptr Both of these parameters seem wrong - you note above that a query could get retried when its still in the INITIALIZED state, so seems like 'check_inflight=true' could fail, and according to the comment about ClientRequestState::Cancel 'nullptr' for cause is supposed to indicate a user-initiated cancel that doesn't have an associated error, which this isn't. I think in the case of an error that comes from the query itself (i.e. Coordinator call TryQueryRetry()) you'll get the error set anyways when UpdateExecState() is called (unless the AuxErrorInfo comes in a different rpc than the overall error message, in which case I think there's a race between the coordinator getting cancelled and no longer accepting reports, though I'm not sure), and in the case of a backend crashing (i.e. CancelFromThreadPool calls TryQueryRetry()), I don't think the error will get set. http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@281 PS13, Line 281: // TODO: IMPALA-9502: Avoid copying TExecRequest when retrying queries Out of curiosity, what are the issues here? -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 13 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 23:04:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#13). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * IMPALA-9225: A query cannot be retried once any results from the original query have been fetched * IMPALA-9200: A query can currently only be retried once Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,267 insertions(+), 427 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/13 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 13: (2 comments) addressed remaining comments http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@502 PS9, Line 502: && !backend_state->exec_rpc_status().IsAborted()) > For this loop, do we have any expectation about the number of BackendStates yeah, good point. fixed. http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py File tests/custom_cluster/test_query_retries.py: http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py@93 PS9, Line 93: Increase the statestore : heartbeat frequency so that the query actually fails during execution. > It looks like we are using the default value for statestore_heartbeat_frequ yeah - the tests infra actually sets a lower value for this: https://github.com/apache/impala/blob/master/tests/common/custom_cluster_test_suite.py#L59 yes, your understanding is correct. a RPC should fail which should cause the query to be retried. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 13 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 12:53:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5916/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 12 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 00:50:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5915/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 11 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 00:48:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5914/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 10 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 00:47:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc@70 PS11, Line 70: shared_ptr QueryDriver::GetClientRequestState(const TUniqueId& query_id) { > line too long (94 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 11 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 00:06:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#12). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * IMPALA-9225: A query cannot be retried once any results from the original query have been fetched * IMPALA-9200: A query can currently only be retried once Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,259 insertions(+), 433 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/12 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 11: (10 comments) Patch set 10 is a rebase. Patch set 11 address the following comments. Addressed / responded to the rest of the comments from Thomas. Addressed most (but not all) comments from Joe. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@113 PS9, Line 113: (2) the : /// query has not already been ret > not sure what this is supposed to mean revised http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@116 PS9, Line 116: d if : /// there has been a c > I think this should be singular "a statestore update" Done http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@122 PS9, Line 122: /// is set to true if the query was actually retried, false otherwise. This method is : /// idempotent, it can safely be called multiple tim > From reading through TryQueryRetry(), it looks like this is safe to call mu updated docs to mention that method is idempotent http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h@84 PS9, Line 84: ; > I don't think this is used anywhere? Done http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@148 PS9, Line 148: shared_ptr query_driver; > I don't think this is used anywhere? Done http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763 PS9, Line 763: HS2_RETURN_IF_ERROR( > I had the same concern. yeah, thats a good point, thanks for catching that. changed it to use a shared_ptr http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@1039 PS9, Line 1039: // The original query profile should still be accessible via the web ui. Don't use > I think its definitely necessary that we provide a way for all relevant pro agree this is an issue. the reason i went with the current approach is so that all methods in the hs2/beeswax interface follow the same "transparent" pattern. i think it might be slightly more intuitive to return the profile of the successful query rather than the originally failed one. although I think returning the original, failed profile could make sense as well. I don't know if one approach is necessarily that much better than the other. i agree that there should be a way to fetch the failed profiles via the hs2 interface, I'm planning to tackle that in IMPALA-9229 http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h@673 PS9, Line 673: std::shared_ptr Stylistically, I personally don't like mutable references for out arguments Done since I had to change this all the shared_ptr, so now it is "shared_ptr*" http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc@1453 PS9, Line 1453: // ClientRequestState::output_exprs_, which > Nit: you could move this declaration down past the first BlockOnWait() Done http://gerrit.cloudera.org:8080/#/c/14824/9/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/14824/9/common/thrift/ImpalaService.thrift@523 PS9, Line 523: > typo Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 11 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 00:04:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc@70 PS11, Line 70: shared_ptr QueryDriver::GetClientRequestState(const TUniqueId& query_id) { line too long (94 > 90) -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 11 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 30 Apr 2020 00:04:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#11). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * IMPALA-9225: A query cannot be retried once any results from the original query have been fetched * IMPALA-9200: A query can currently only be retried once Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,258 insertions(+), 433 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/11 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#10). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * IMPALA-9225: A query cannot be retried once any results from the original query have been fetched * IMPALA-9200: A query can currently only be retried once Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,253 insertions(+), 448 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/10 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 9: (7 comments) http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@502 PS9, Line 502: RETURN_IF_ERROR(HandleFailedExecRpc(backend_state)); For this loop, do we have any expectation about the number of BackendStates that have errors? In other words, can more than one have errors? If there can be more than one error, we would call TryQueryRetry more than once, but based on my reading of that function, it can handle that. Am I right? Is there any concern that if there is more than one error, we may want to blacklist more than one Impalad before retrying? http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@116 PS9, Line 116: a : /// statestore updates I think this should be singular "a statestore update" http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@122 PS9, Line 122: Status TryQueryRetry(ClientRequestState* client_request_state, Status* status, : bool* was_retried = nullptr) WARN_UNUSED_RESULT; >From reading through TryQueryRetry(), it looks like this is safe to call >multiple times and it will only do the retry once. Is that true? Can we add a >comment here that spells out that it can be called more than once? http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763 PS9, Line 763: ClientRequestState* request_state = nullptr; > So I'm wondering if its safe to no longer have any shared_ptr here and in t I had the same concern. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h@673 PS9, Line 673: ClientRequestState*& Stylistically, I personally don't like mutable references for out arguments. Using ** instead makes it much clearer that values are being passed between functions. Callers need to pass addresses. The function itself can't treat it like a local variable. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc@1453 PS9, Line 1453: ClientRequestState::RetryState retry_state; Nit: you could move this declaration down past the first BlockOnWait() http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py File tests/custom_cluster/test_query_retries.py: http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py@93 PS9, Line 93: Increase the statestore : heartbeat frequency so that the query actually fails during execution. It looks like we are using the default value for statestore_heartbeat_frequency_ms. Is that intentional? If I understand this correctly, this test uses the heavy shuffle query and should see an RPC failure. We are not expecting the statestore to see the Impalad as dead. Is that right? -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 9 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 27 Apr 2020 17:48:11 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 9: (2 comments) responded to a few comments from Thomas, still thinking through some of his other comments. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@922 PS9, Line 922: parent_query_driver_->TryQueryRetry(parent_request_state_, _status)); yeah, agree it is unfortunate. although i think the issue has always been there. the coordinator class already has a parent_request_state_ that points to the owning ClientRequestState. the ClientRequestState has a similar pattern, it has a parent_server_ which points to the owning ImpalaServer. i'm not sure this is overall a major problem, it does make the code a bit confusing, but I've done my best to limit the usage of the parent_query_driver_ class. only the TryQueryRetry method is ever called on the parent_query_driver_. yeah, it does take the coordinator lock, but only for a pretty short period of time - it just checks some state. the actual work to schedule and run the retried query is done in a separate thread, that doesn't require holding the coordinator lock. acquiring the lock should only be done rarely - only when a query fails with a retryable error. > Would it be possible to instead have the QueryDriver wait on the coordinator > to finish and then check its status and decide whether to retry then? > One problem is the QueryDriver needs to know not just if the query hit an > error but if the error was something retryable, but we could do something > like have the coordinator remember any nodes it blacklists and expose that > info to the QueryDriver. yeah, I considered this at some point, and it might be a cleaner design, just not sure how much re-factoring it is going to require. i can convince myself the current approach of just calling TryRetryQuery in the coordinator is fine as well. I think they can be thought of as "callback" functions into the QueryDriver that get triggered under specific conditions. if you still feel strongly about it, i can do some digging, but would prefer to make the code changes in a follow up patch because i don't think it will be a straightforward / small change to make, and i don't want to expand the scope of this patch further. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@1060 PS9, Line 1060: ExecEnv::GetInstance()->cluster_membership_mgr()->BlacklistExecutor( > This of course doesn't actually guarantee that the retried query won't be s that's a good point, filed IMPALA-9636 to fix this. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 9 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 09 Apr 2020 17:57:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 9: (8 comments) Still going through this, but some high level thoughts (and a few nit-picks I happened to notice already) http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@922 PS9, Line 922: parent_query_driver_->TryQueryRetry(parent_request_state_, _status)); So obviously the circular dependency here between QueryDriver, ClientRequestState, and Coordinator is unfortunate. It probably works as is, and it might be difficult to get rid of without significant code restructuring, but I'm concerned its confusing/brittle (eg. this call results in us taking ClientRequestState::lock_ on the ReportExecStatus rpc thread) and so I've been thinking through the options. Would it be possible to instead have the QueryDriver wait on the coordinator to finish and then check its status and decide whether to retry then? One problem is the QueryDriver needs to know not just if the query hit an error but if the error was something retryable, but we could do something like have the coordinator remember any nodes it blacklists and expose that info to the QueryDriver. Another issue is that it means we won't start the retry quite as quickly, since we have to wait for the QueryDriver to notice the error, but that might be fine - its already the case that Coordinator::Wait() will return immediately after an error is reported, without waiting for other backends to be cancelled, see HandleExecStateTransition()->CancelBackends()->backend_exec_complete_barrier_->NotifyRemaining(). The worse case is when the blacklisting info and the error status don't arrive in the same report, eg. the call to TryQueryRetry below, but that should be rare since the error status will have been generated at the same time as the AuxErrorInfo, so you just have to get pretty unlucky with the timing of when the report is generated. Maybe the bigger issue is I'm not sure its easy to do with the way the code is set up. I was trying to trace through the various Wait()/WaitAsync()/BlockOnWait() whatever calls but I'm not sure how that all works now. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@1060 PS9, Line 1060: ExecEnv::GetInstance()->cluster_membership_mgr()->BlacklistExecutor( This of course doesn't actually guarantee that the retried query won't be scheduled on the executor that gets blacklisted, eg. if it takes longer for the query to get through admission control again than the blacklist timeout. We might want to consider doing something like passing through a list of executors to reschedule on, to avoid having queries repeatedly fail in the same way. On the other hand, the blacklist timeout was designed to be longer than it should take the statestore to notice the executor is down, so in theory if the executor really is down maybe it doesn't matter. Probably not necessary to do anything different for this patch, though, just wanted to point this out. http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h File be/src/runtime/query-driver.h: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@113 PS9, Line 113: (2) the : /// query has already been retried not sure what this is supposed to mean http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h@84 PS9, Line 84: UNKNOWN I don't think this is used anywhere? http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@148 PS9, Line 148: unique_ptr exec_request = make_unique(); I don't think this is used anywhere? http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763 PS9, Line 763: ClientRequestState* request_state = nullptr; So I'm wondering if its safe to no longer have any shared_ptr here and in the other places in this file - what's to stop the QueryDriver from getting deleted by Unregister while this function is executing, which would make this pointer no longer valid? http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@1039 PS9, Line 1039: // If the query was retried, fetch the profile for the most recent attempt of the query I think its definitely necessary that we provide a way for all relevant profiles to be accessed through HS2, not just the webserver. This is one case where the
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 9: After a lot of cleanup, doc updates, and testing, I think this patch is finally ready to go (open to general comments). I made a big change to the commit message to describe the patch + all the testing I have done. There are a few TODOs sprinkled around the code, but non are essential (most are cleanup related). -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 9 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 31 Mar 2020 18:45:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5664/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 9 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 31 Mar 2020 18:41:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5663/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 8 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 31 Mar 2020 18:39:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#9). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * IMPALA-9225: A query cannot be retried once any results from the original query have been fetched * IMPALA-9200: A query can currently only be retried once Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,253 insertions(+), 448 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/9 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 8: (2 comments) http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py File tests/stress/query_retries_stress_runner.py: http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@160 PS8, Line 160: e > flake8: E722 do not use bare except' Done http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@164 PS8, Line 164: e > flake8: E722 do not use bare except' Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 8 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 31 Mar 2020 18:08:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 8: (2 comments) http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py File tests/stress/query_retries_stress_runner.py: http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@160 PS8, Line 160: e flake8: E722 do not use bare except' http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@164 PS8, Line 164: e flake8: E722 do not use bare except' -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 8 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 31 Mar 2020 18:05:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#8). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue Limitations: * IMPALA-9225: A query cannot be retried once any results from the original query have been fetched * IMPALA-9200: A query can currently only be retried once Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt D be/src/service/client-request-state-map.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 26 files changed, 2,253 insertions(+), 448 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/8 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 7: (1 comment) > Instead of creating a separate doc, think it might be easier to > just post comments here. > > I've done some re-factoring locally and this is what I have so far: > > * Created a "QueryDriver", which (as suggested by Thomas) is a > layer of abstraction of on top of ClientRequestState > * "QueryDriver" owns the ClientRequestState; all access to the > ClientRequestState goes through the QueryDriver > * Since the first patch will only support a single retry, the > QueryDriver has a regular client_request_state_ and a > retried_client_request_state_ (one for the original query and the > other for the retried query) > * QueryDriver owns the TExecRequest that is used by both the > original and retried queries > * The "transparent" part of this feature is now handled by the > QueryDriver (since the QueryDriver owns all access to the > ClientRequestState) > * The QueryDriver has two methods: GetActiveClientRequestState() > and GetClientRequestState(query_id) > * GetActiveClientRequestState() returns the CRS for the "active" > query - e.g. if the query has not been retried yet, it returns the > original query; if the query has been retried, it returns the > retried query > * GetClientRequestState(query_id) returns the CRS for the query > that matches the given 'query_id' - this allows clients to get the > CRS of exactly the query_id they are looking for > * ImpalaServer creates a QueryDriver for each query; the > ImpalaServer::client_request_state_map_ has been replaced with the > ImpalaServer::query_driver_map_ which maps query ids to > QueryDrivers > * The query_driver_map_ can have key, values pairs with the same > value (e.g. the same QueryDriver) > * Moved all of the retry code (along with the retry threadpool) > into QueryDriver (specifically QueryDriver::RetryQueryFromThreadPool) > > Example: > > * Client submits a query for execution > * Impala creates a QueryDriver for the query > * QueryDriver::CreateClientRequestState creates a ClientRequestState > for the query (this CRS is owned by the QueryDriver) > * ImpalaServer::RegisterQuery "registers" the query and adds an > entry to ImpalaServer::query_driver_map_ that maps the query id to > the QueryDriver > * QueryDriver::RunFrontendPlanner creates the TExecRequest (owned > by the QueryDriver) > * Eventually, the query fails and is retried > * QueryDriver::RetryQueryFromThreadPool contains all the retry > logic (moved from ImpalaServer) > * It cancels the original query, creates the new one and runs it > * It calls ImpalaServer::RegisterQuery again, but with the retried > query id > * It manipulates some internal pointers so that retried_client_request_state_ > now points to the CRS of the retried query That all sounds good to me http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025 PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()()); > I'll try thinking through this some more. One challenge with creating an in Sure, I think this can probably be deferred for now and we can go with your original approach. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 7 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 24 Feb 2020 21:25:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025 PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()()); > I sort of did this in the previous patch update. I added a client_query_id_ I'll try thinking through this some more. One challenge with creating an internal vs. user facing id is that a common debugging practice is to take user facing id and use it to grep through the logs. A lot of the log statements contain the query id they are associated with. So the user would have to manually figure out the mapping from the user facing id (perhaps it got the id from the runtime profile or the client logs) and the internal id. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 5 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 13 Feb 2020 19:06:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 7: Instead of creating a separate doc, think it might be easier to just post comments here. I've done some re-factoring locally and this is what I have so far: * Created a "QueryDriver", which (as suggested by Thomas) is a layer of abstraction of on top of ClientRequestState * "QueryDriver" owns the ClientRequestState; all access to the ClientRequestState goes through the QueryDriver * Since the first patch will only support a single retry, the QueryDriver has a regular client_request_state_ and a retried_client_request_state_ (one for the original query and the other for the retried query) * QueryDriver owns the TExecRequest that is used by both the original and retried queries * The "transparent" part of this feature is now handled by the QueryDriver (since the QueryDriver owns all access to the ClientRequestState) * The QueryDriver has two methods: GetActiveClientRequestState() and GetClientRequestState(query_id) * GetActiveClientRequestState() returns the CRS for the "active" query - e.g. if the query has not been retried yet, it returns the original query; if the query has been retried, it returns the retried query * GetClientRequestState(query_id) returns the CRS for the query that matches the given 'query_id' - this allows clients to get the CRS of exactly the query_id they are looking for * ImpalaServer creates a QueryDriver for each query; the ImpalaServer::client_request_state_map_ has been replaced with the ImpalaServer::query_driver_map_ which maps query ids to QueryDrivers * The query_driver_map_ can have key, values pairs with the same value (e.g. the same QueryDriver) * Moved all of the retry code (along with the retry threadpool) into QueryDriver (specifically QueryDriver::RetryQueryFromThreadPool) Example: * Client submits a query for execution * Impala creates a QueryDriver for the query * QueryDriver::CreateClientRequestState creates a ClientRequestState for the query (this CRS is owned by the QueryDriver) * ImpalaServer::RegisterQuery "registers" the query and adds an entry to ImpalaServer::query_driver_map_ that maps the query id to the QueryDriver * QueryDriver::RunFrontendPlanner creates the TExecRequest (owned by the QueryDriver) * Eventually, the query fails and is retried * QueryDriver::RetryQueryFromThreadPool contains all the retry logic (moved from ImpalaServer) * It cancels the original query, creates the new one and runs it * It calls ImpalaServer::RegisterQuery again, but with the retried query id * It manipulates some internal pointers so that retried_client_request_state_ now points to the CRS of the retried query -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 7 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Thu, 13 Feb 2020 19:05:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5186/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 7 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 10 Feb 2020 17:36:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 6: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/5184/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 6 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 10 Feb 2020 17:34:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc@1463 PS6, Line 1463: void ImpalaServer::BlockOnWait(shared_ptr* request_state, bool* timed_out, > line too long (94 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 6 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 10 Feb 2020 16:51:50 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#7). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes. Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the leaving node, queries are retried), and (2) if a query fails and as a result, blacklists a node. Both events are considered cluster membership changes as they affect what nodes a query will be scheduled on. Implementation: * Query retries are driven by a dedicated threadpool * ImpalaServer::RetryQueryFromThreadPool implements the core logic to actually retry a failed query. * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A query cannot be retried once any results from the original query have been fetched, this is to prevent users from seeing incorrect results Features: * Retries are transparent to the user * This is achieved by adding a mapping from failed query ids to the query id of the retried query * ImpalaServer uses this mapping in GetClientFacingRequestState which is used to differentiate between "client facing" requests for a ClientRequestState vs. internal requrets for a CRS * Users can tell if a query is retried using runtime profiles and the Impala Web UI * "Impala Query Status" is a new field in runtime profiles that displays the ClientRequestState execution state (which includes the RETRYING and RETRIED states) * The Impala Web UI will list all retried queries as being in the "RETRIED" state * Retried queries skip all fe/ planning, authorization, etc. * This feature is configurable ('retry_failed_queries') and is off by default Refactoring: * Changes the ClientRequestState so that it can take in an existing TExecRequest * This is required when retrying queries because the TExecRequest of the failed query is copied and used for the ClientRequestState of the retried query * ClientRequestState::ExecState is extended with three new states: RETRYING, RETRIED, and UNKNOWN. Testing: * Added integration tests in test_query_retries.py, these tests consistently pass when run locally * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures TODO: * Additional re-factoring / code cleanup * Lots more documentation Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h M be/src/service/query-options.cc M be/src/service/query-options.h A be/src/service/retry-work.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py A tests/custom_cluster/test_query_retries.py 16 files changed, 793 insertions(+), 168 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/7 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 7 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc@1463 PS6, Line 1463: void ImpalaServer::BlockOnWait(shared_ptr* request_state, bool* timed_out, line too long (94 > 90) -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 6 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 10 Feb 2020 16:51:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 5: (7 comments) Addressed some of the comments and filed a few follow up JIRAs. Going to start a doc to discuss the "transparent" part of this feature (e.g. the internal query ids). Working on doing some of the re-factoring you mentioned (e.g. adding another layer of abstraction on top of ClientRequestState). http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@26 PS5, Line 26: A query cannot be retried once any results from the original query : have been fetched, this is to prevent users from seeing incorrect results > Do we have any mechanism for a user to specify that they want their results Not yet, but planning to tackle that in a follow up JIRA: IMPALA-9225 http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@41 PS5, Line 41: The Impala Web UI will list all retried queries as being in the : "RETRIED" state > This may be a reasonable approach, but I think we should think about the us Yeah, these are legitimate issues. We have few follow up JIRAs to tackle some of these problems: IMPALA-9229, IMPALA-9230, IMPALA-9213 and I just filed IMPALA-9364 as well - open to more suggestions about how to improve supportability. For this patch, I think it should be sufficient to add the original query id to the retried query profile, and vice versa. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h File be/src/service/client-request-state-map.h: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h@38 PS5, Line 38: bool overwrite = false > I don't think this is being used anywhere anymore. Done http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h@82 PS5, Line 82: RETRYING, RETRIED, UNKNOWN > I think this approach, and the way you have to save the previous state and Yeah, thats a good suggestion. Haven't looked into adding the ClientRequestState wrapper yet, but for now just split out the the RETRYING and RETRIED states into a `enum RetryState`. It actually simplifies the state management considerably, so thanks for the suggestion. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@632 PS5, Line 632: shared_ptr request_state = GetClientFacingRequestState(query_id); > Unless I'm missing something in this patch, this means that only the profil hmm not sure why I did that, changed back to GetClientRequestState http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@920 PS5, Line 920: unique_ptr exec_request = make_unique(); > So I'm wondering if this patch would benefit from adding another layer of a I think that makes sense. IMO the ImpalaServer <--> ClientRequestState <--> Coordinator code needs some re-factoring in general. I wanted to some general re-factoring all this code (ImpalaServer, ClientRequestState, Coordinator) so I filed IMPALA-9370 as a follow up, but I think it makes sense to move some of the retry logic into its own class in this patch. I re-factored the BlockOnWait code so that it isn't duplicated across the hs2 and beeswax server. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025 PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()()); > As mentioned before, I definitely think it would be awesome if we could cre I sort of did this in the previous patch update. I added a client_query_id_mapping_ that allows an optional mapping between query ids. It doesn't exactly create an "internal" id, but its cleaner that what I was doing before. I think I'm going to resurrect the design doc for this feature and write up some notes on how best to approach this. We can finalize the design there. The mixing of the hi/lo of the TUniqueId is an interesting idea. I'll be sure to include that. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 5 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Mon, 10 Feb 2020 16:50:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Hello Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#6). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes. Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the leaving node, queries are retried), and (2) if a query fails and as a result, blacklists a node. Both events are considered cluster membership changes as they affect what nodes a query will be scheduled on. Implementation: * Query retries are driven by a dedicated threadpool * ImpalaServer::RetryQueryFromThreadPool implements the core logic to actually retry a failed query. * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A query cannot be retried once any results from the original query have been fetched, this is to prevent users from seeing incorrect results Features: * Retries are transparent to the user * This is achieved by adding a mapping from failed query ids to the query id of the retried query * ImpalaServer uses this mapping in GetClientFacingRequestState which is used to differentiate between "client facing" requests for a ClientRequestState vs. internal requrets for a CRS * Users can tell if a query is retried using runtime profiles and the Impala Web UI * "Impala Query Status" is a new field in runtime profiles that displays the ClientRequestState execution state (which includes the RETRYING and RETRIED states) * The Impala Web UI will list all retried queries as being in the "RETRIED" state * Retried queries skip all fe/ planning, authorization, etc. * This feature is configurable ('retry_failed_queries') and is off by default Refactoring: * Changes the ClientRequestState so that it can take in an existing TExecRequest * This is required when retrying queries because the TExecRequest of the failed query is copied and used for the ClientRequestState of the retried query * ClientRequestState::ExecState is extended with three new states: RETRYING, RETRIED, and UNKNOWN. Testing: * Added integration tests in test_query_retries.py, these tests consistently pass when run locally * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures TODO: * Additional re-factoring / code cleanup * Lots more documentation Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h M be/src/service/query-options.cc M be/src/service/query-options.h A be/src/service/retry-work.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py A tests/custom_cluster/test_query_retries.py 16 files changed, 793 insertions(+), 168 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/6 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 6 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 5: (7 comments) http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@26 PS5, Line 26: A query cannot be retried once any results from the original query : have been fetched, this is to prevent users from seeing incorrect results Do we have any mechanism for a user to specify that they want their results to always be spooled until the query is complete to ensure that any failed queries that are retriable do in fact get retried? http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@41 PS5, Line 41: The Impala Web UI will list all retried queries as being in the : "RETRIED" state This may be a reasonable approach, but I think we should think about the user experience implications of this. In particular, it means that a retried query will show up as two separate queries on the webui, which is potentially confusing, esp. since the retry will have a different query_id that won't correspond the anything user-facing (eg. won't be the query_id printed out by impala-shell) and as you've got the patch right now, I'm not even sure if there's a way to figure out which retries would correspond to which original queries. We could of course do something like add the retry query id to the profile of the original query, but then people have to click through to the profile to match things up. There's also the problem that once CloseClientRequestState is called at the end of RetryQueryFromThreadPool, the original query will show up in the "Completed Queries" section (and may even disappear from there before the retry query completes if the query_log fills up), which is again probably confusing since from the perspective of the client that query is really still running. Some of this could possibly wait until a follow up patch to work out, as long as we've got a plan. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h File be/src/service/client-request-state-map.h: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h@38 PS5, Line 38: bool overwrite = false I don't think this is being used anywhere anymore. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h@82 PS5, Line 82: RETRYING, RETRIED, UNKNOWN I think this approach, and the way you have to save the previous state and then sometimes use this or sometimes the previous state, is confusing. Its probably more straight forward to just have a separate 'enum Retry State { RETRYING, RETRIED, NOT_RETRIED } >From my other comments, if we add a ClientRequestState wrapper, it could go in >there, or even might not be necessary since the wrapper would already know if >we're retrying based on if there's multiple ClientRequestStates that its >wrapped around. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@632 PS5, Line 632: shared_ptr request_state = GetClientFacingRequestState(query_id); Unless I'm missing something in this patch, this means that only the profile for the retry query will be available through the GetRuntimeProfile() api. Possibly fine to leave for followup work, but I think we definitely need to have some way of returning the profile for the original query too. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@920 PS5, Line 920: unique_ptr exec_request = make_unique(); So I'm wondering if this patch would benefit from adding another layer of abstraction. What I have in mind is something like a wrapper around ClientRequestState that can contain multiple ClientRequestStates (though possibly we would name the wrapper ClientRequestState and rename the current ClientRequestState to something like QueryInstanceState or whatever). Some benefits of doing this: - It would provide a natural place for TExecRequest to live without doing this stuff with unique_ptr and std::move, as well as anything else that's common across all retries. - It would allow us to hide some of the details of retries from impala-server, eg. we wouldn't need the logic around BlockOnWait that you've duplicated between the hs2 server and the beeswax server. http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025 PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()()); As mentioned before, I definitely think it would be awesome if we
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5517/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 5 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 24 Jan 2020 23:18:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14824 ) Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5516/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 4 Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 24 Jan 2020 23:16:59 + Gerrit-HasComments: No