subject:"\[Impala\-ASF\-CR\] IMPALA\-9199\: Add support for single query retries on cluster membership changes"

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 32: Code-Review+2

Carrying +2.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 32
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 20:10:59 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Sahil Takiar (Code Review)

Sahil Takiar has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry
* QueryDrivers and ClientRequestStates are now referenced using a
  QueryHandle

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Reviewed-on: http://gerrit.cloudera.org:8080/14824
Tested-by: Impala Public Jenkins 
Reviewed-by: Sahil Takiar 
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
27 files changed, 2,700 insertions(+), 696 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Sahil Takiar: Looks good to me, approved

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 32: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 32
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 20:06:19 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 32:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6082/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 32
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 15:26:22 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 32:

(1 comment)

Fixing a few issues that are causing the test failures:

(1) Removed a DCHECK (see comment below)
(2) Fixed a bug where GetQueryHandle was not passing return_unregistered to 
GetQueryDriver

Ran a bunch of more tests locally, and things look good:
* Re-ran the stress tests for tpcds and tpch
* Looped the new tests overnight and no failures

Did some additional cleanup of code comments, removing unnecessary imports, 
minor-refactoring, etc.

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511
PS26, Line 511: if (!failed_backend_states.empty()) {
> Done
I had to remove the DCHECK because it caused a crash. Turns out that because of 
the IsAborted() check above, its possible for this list to be empty. Tests that 
use the EXEC_SERIALIZE_FRAGMENT_INFO Debug Action trigger the DCHECK.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 32
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 14:47:57 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 32:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5845/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 32
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 14:44:44 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#32).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry
* QueryDrivers and ClientRequestStates are now referenced using a
  QueryHandle

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
27 files changed, 2,700 insertions(+), 696 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/32
--
To view, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-15 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 31: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5839/


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 31
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 08:27:59 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 31:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6077/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 31
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 04:01:38 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#31).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry
* QueryDrivers and ClientRequestStates are now referenced using a
  QueryHandle

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
27 files changed, 2,700 insertions(+), 696 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/31
--
To view, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 31:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5839/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 31
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 03:09:39 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 30: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5836/


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 30
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 15 May 2020 01:22:42 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 29:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6067/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 29
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 14 May 2020 21:41:06 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 30:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5836/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 30
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 14 May 2020 20:56:52 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 30: Code-Review+2

Re-based, carrying +2. I'm going to run the pre-commit job to make sure it 
passes, but I'm also going run the stress tests again locally to make sure they 
still pass. I'm also going to do one final read-through of the code to make 
sure we didn't miss anything.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 30
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 14 May 2020 20:56:29 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 29:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58
PS27, Line 58:   return *request_state_;
> Sure, I mean there are basically two cases for how this might be used:
I think I see what your saying now. Yeah, your suggestion makes sense since the 
caller can always decide if they want to make a copy.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136: /// single query submission.
> > I think in that case the error still
ahh I see what you are saying now - right, the code path in 
UpdateBackendExecStatus where AuxErrorInfoPB is present, but no error is 
returned from the code path. the 'retryable_status' gets dropped after the call 
to TryQueryRetry. looks like this happens in CancelFromThreadPool and 
HandleFailedExecRpcs as well.
filed IMPALA-9748 to follow up on this


http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc@35
PS28, Line 35: Finalize() must
> Finalize()
Done


http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc@150
PS28, Line 150:   if (!register_status.ok()) {
> You could also move the call to CreateClientRequestState into CreateNewDriv
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177
PS27, Line 1177: in
> In that case, a non-const QueryHandle& or a QueryHandle*. Not a big deal th
Actually, your original suggestion does work. The fact that QueryHandle is a 
const doesn't really matter because "->" returns a non-const pointer to 
ClientRequestState. So I re-factored CloseClientRequestState, 
UpdateExecSummary, and ArchiveQuery to all take in a const QueryHandle&.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 29
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 14 May 2020 20:49:45 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-14 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#29).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
27 files changed, 2,703 insertions(+), 696 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/29
--
To view, visit http://gerrit.cloudera.org:8080/14824
To

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-13 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 28: Code-Review+2

(5 comments)

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58
PS27, Line 58: inline ClientRequestState* o
> isn't the point here that the pointer returned by query_driver() should be
Sure, I mean there are basically two cases for how this might be used:
- Most of the uses of this in the current patch are calls like 
"query_handle.query_driver()->SomeQueryDriverFn()", in which case the 
QueryHandle is going to be in scope for the whole call and making this a 
"const&" saves an atomic inc/dec without any risk.
- If a developer wants to make a local variable like "shared_ptr = 
query_handle.query_driver()",  making this a "const&" gives them the option of 
whether or not to make a copy (i.e. whether the local variable itself is 
declared a const& or not). This introduces a little risk, cause a dev could not 
make a copy in a situation where they actually needed one.

Its different than the case with adding the operator-> since there we were 
returning a bare pointer to a ClientRequestState which doesn't allow the caller 
to get any guarantees on its lifetime even if they want.

Obviously this isn't really super perf critical code, so its not a big deal. 
Feel free to leave as is.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136: /// id, which will be distinct from the query id of the 
originally submitted query that
> I think in that case the error still
 > gets propagated to the the ClientRequestState::query_status_
 > though, which is why the errors show up in the runtime profile.

I think that's true in some cases but not in other cases, but its fine.


http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/runtime/query-driver.cc@35
PS28, Line 35: StartUnregister
Finalize()


http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/28/be/src/service/impala-hs2-server.cc@150
PS28, Line 150:   query_driver->CreateClientRequestState(query_ctx, session, 
_handle);
You could also move the call to CreateClientRequestState into CreateNewDriver 
and then I think it wouldn't need to be public anymore and you wouldn't need to 
define the 'query_driver' variable on the line before this, and then pass the 
'query_handle' into RegisterQuery which is nice for being consistent about 
passing QueryHandles everywhere.

Fine to leave as is though


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/impala-server.cc@1177
PS27, Line 1177: _m
> CloseClientRequestState requires a bunch of non-const access - e.g. it pass
In that case, a non-const QueryHandle& or a QueryHandle*. Not a big deal though



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 28
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 13 May 2020 18:01:22 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-12 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 28:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6045/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 28
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 12 May 2020 23:53:22 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-12 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 27:

(25 comments)

> I know I've asked for a lot, but I think the patch is a lot better now.

Not a problem, thanks for all the thorough feedback. Don't think this patch 
would be nearly as clean without all the comments.

> I still have a lot of concerns about this feature, esp. about our story 
> around observability, but I think those concerns can mostly be addressed in 
> follow up work and most of my remaining comments are more minor things.

Right. I don't think the expectation should be that this feature is complete by 
any means. I'm not sure how feature development has been done historically in 
Impala, but IMO I think features need to be built piece-meal over 
multiple-commits, before they can really be used. I won't rant too long about 
this, but I just think that getting an initial version that users can at least 
play around with gives us a level of feedback and testing that can't be done 
within a scoped design review session / single code review.

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079
PS26, Line 1079:
> I don't think that's true - ClientRequestState::UpdateQueryStatus() is never 
> called from inside Coordinator.

I guess, a '-->' wasn't the correct way to describe this. StartBackendExec() 
returns an error to Exec() which returns an error to 
FinishExecQueryOrDmlRequest() which then calls UpdateQueryStatus().

> Now, MarkAsRetrying() will set an error on ClientRequestState but after that 
> Coordinator::Exec() will still return an error which ClientRequestState will 
> try to set on itself with UpdateQueryStatus() but it'll get discarded because 
> an error was already set.

Right.

> The point is, its messy and I'm concerned it will be error prone.

I agree that it is more complex. I'm not sure I understand why it is 
particularly more error prone.

It's not that I don't want to take your suggestion, I just think at this point 
it is better to address this separately. My hesitation is that doing this 
re-factoring now is going to introduce bugs into the patch unnecessarily. From 
experience when first iterating on this patch, doing seemingly harmless 
re-factoring has introduced a lot of bugs. So I'm not really convinced it is 
worth the effort at this point in time.

Taking a step back, I think at this point in the feature lifecyle, its not 
worth investing a huge amount of time on this. The feature itself it still 
immature, and subject to change. It's hard to predict exactly what it will look 
like after all the planned enhancement, so a lot of this debate might end up 
being a moot point.

I am taking note of this feedback though, and plan to track it in IMPALA-9370.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910
PS27, Line 910:   Status status = Status::OK();
> I don't think declaring this here is necessary since its not used outside o
Done


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58
PS27, Line 58: std::shared_ptr
> I think you might want to make this a '&' to avoid unnecessary copies
isn't the point here that the pointer returned by query_driver() should be able 
to out-live the QueryHandle instance? in which case you would need to take a 
shared_ptr copy to ensure proper ref counting?

at least, I thought that was part of the argument for creating the whole "->" 
and "*" operator overloading.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93
PS27, Line 93: TryRetryQuery
> TryQueryRetry
Done


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136: ///
> Right, but MarkAsRetrying doesn't get called in the case where you call 
> AddDetail() because TryQueryRetry returns immediately after AddDetail()

Whoops, yeah you are right. I think in that case the error still gets 
propagated to the the ClientRequestState::query_status_ though, which is why 
the errors show up in the runtime profile.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228
PS27, Line 228:   // Run the new query.
> I think its fine to leave for now, but just wanted to note that the duplica

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-12 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#28).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
27 files changed, 2,692 insertions(+), 682 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/28
--
To view, visit http://gerrit.cloudera.org:8080/14824
To

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-12 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 27:

(25 comments)

Thanks for the changes. I know I've asked for a lot, but I think the patch is a 
lot better now.

I still have a lot of concerns about this feature, esp. about our story around 
observability, but I think those concerns can mostly be addressed in follow up 
work and most of my remaining comments are more minor things.

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079
PS26, Line 1079:
> Before this patch, if a query fails: 
> ClientRequestState::FinishExecQueryOrDmlRequest()
 > --> Coordinator::Exec() --> Coordinator::StartBackendExec() -->
 > ClientRequestState::UpdateQueryStatus() which sets
 > ClientRequestState::query_status_.

I don't think that's true - ClientRequestState::UpdateQueryStatus() is never 
called from inside Coordinator. If an Exec() rpc fails, before this patch 
Coordinator::Exec() will return an error status and ClientRequestState will 
call UpdateQueryStatus() on itself to set the error that's returned from Exec().

Now, MarkAsRetrying() will set an error on ClientRequestState but after that 
Coordinator::Exec() will still return an error which ClientRequestState will 
try to set on itself with UpdateQueryStatus() but it'll get discarded because 
an error was already set.

As far as I can tell, prior to this patch if the Coordinator needed to 
communicate an error to the ClientRequestState it always did so by returning an 
error Status, eg. if a backend reports an error usually that gets bubbled up 
when the ClientRequestState calls GetNext(), and now we've added this other 
patch where the ClientRequestState calls into the Coordinator and during that 
call the Coordinator calls back into the ClientRequestState to set the error 
even though its also about to return another Status that represents the same 
error with a slightly different text. The point is, its messy and I'm concerned 
it will be error prone.

 > A lot of the re-factoring we end up
 > doing now might just be wasted work if we end up changing half the
 > logic.

Sure, you don't have to take my suggestions, esp. if you have a good reason/a 
vision of your own for how to clean this all up.

I'm trying to help make sure this patch is well designed because I think it'll 
make the further work easier, eg. minimizing the amount of retry related logic 
that ends up in Coordinator should make it easier to make further changes to 
how retries work.

If you're really determined to keep the calls to TryQueryRetry in Coordinator 
in this patch, its fine with me.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/coordinator.cc@910
PS27, Line 910:   Status status = Status::OK();
I don't think declaring this here is necessary since its not used outside of 
the if/else immediately below


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@58
PS27, Line 58: std::shared_ptr
I think you might want to make this a '&' to avoid unnecessary copies


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.h@93
PS27, Line 93: TryRetryQuery
TryQueryRetry


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136: ///
> TryQueryRetry passes the status to
 > MarkAsRetrying which sets the given status as the query_status_ of
 > the ClientRequestState, which is what gets exposed in the runtime
 > profile.

Right, but MarkAsRetrying doesn't get called in the case where you call 
AddDetail() because TryQueryRetry returns immediately after AddDetail()


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/runtime/query-driver.cc@228
PS27, Line 228:   // Run the new query.
I think its fine to leave for now, but just wanted to note that the duplication 
of the query execution logic between here and ImpalaServer is unfortunate and 
it might be nice to have more of a QueryDriver::RunQuery() type function that 
works for both initial attempts and retries when we're doing refactoring.


http://gerrit.cloudera.org:8080/#/c/14824/27/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-12 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 27:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6041/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 27
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 12 May 2020 17:52:33 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-12 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#27).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
27 files changed, 2,634 insertions(+), 675 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/27
--
To view, visit http://gerrit.cloudera.org:8080/14824
To

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-11 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 26:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@511
PS26, Line 511: if (!failed_backend_states.empty()) {
Seems like it shouldn't be possible for this to ever be false. Maybe add a 
DCHECK?


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@512
PS26, Line 512: RETURN_IF_ERROR
I don't think its possible for HandleFailedExecRpcs() to return an error, and 
even if it did it probably wouldn't be correct to just return like this here 
since we still need to call UpdateExecState() below to update the query to an 
error status.


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@982
PS26, Line 982:   
parent_query_driver_->TryQueryRetry(parent_request_state_, _status);
Probably fine to leave as is for now since currently any query that hits this 
is going to end up failing, but just wanted to point out that this code path 
means that we could potentially cancel the query without having actually 
received an error status back yet.

As noted elsewhere, this makes it more difficult to use the AddDetail() on the 
Status out parameter in TryQueryRetry.


http://gerrit.cloudera.org:8080/#/c/14824/26/be/src/runtime/coordinator.cc@1079
PS26, Line 1079:   parent_query_driver_->TryQueryRetry(parent_request_state_, 
_status);
Wanted to point out that there's an unfortunate situation here and elsewhere 
TryQueryRetry is called where we end up with two slightly different versions 
for the overall query status - the one that gets set on Coordinator (in this 
case, exec_rpc_status, which represents the error from the first backend to 
fail in Exec()) and 'retryable_status' which gets set on ClientRequestState in 
MarkAsRetrying().

Normally, those two statuses would be the same, because ClientRequestState 
would get Coordinator's state returned to it from Exec(), which it would then 
set on itself, it won't set it anymore as we only preserve the first error 
message. I think that's messy and has the potential to be confusing for 
developers.

Personally, I still really think it would be worth it to do the work to at 
least somewhat remove the circular dependency here, eg. you could move the call 
to TryQueryRetry to ClientRequestState::UpdateQueryStatus() along with defining 
a Coordinator::BlacklistedExecutors() or something like that, It wouldn't be 
much work, it would solve this problem, and it would make the control flow a 
lot cleaner, make Coordinator better encapsulated since it wouldn't need to 
know anything about retries, etc.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136:   /// query option 'retry_failed_queries') and if the query can 
be retried. Queries can
> Doesn't return a Status anymore.
Sure, 'status' is used in one code path that calls TryQueryRetry (the code path 
in Coordinator where we get an error in a status report, which 
test_multiple_retries is testing), but its not used in any of the other code 
paths that call TryQueryRetry so it gets lost in those cases (eg. 
HandleFailedExecRpcs)

One option would be to just fix those code paths to actually set the status 
they passed in as the out parameter to the overall error status, but that's not 
always going to be easy (eg. the path in Coordinator where we get an 
AuxErrorInfo but no overall query error status in the report, so we have no 
error status to add the details to).

Instead, maybe it would be better to just add an info string to the profile 
that says "This query was not retried because..."

One disadvantage of that is that it means you can't immediately tell from the 
overall error why the query wasn't retried and have to dig down into the 
profile/logs, but that's already the case since you're only adding the detail 
if we hit the max retries and not for other cases such as if rows had already 
been fetched, though maybe you want to do an AddDetail for those cases too.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204
PS24, Line 204:   /// 'RunFrontendPlanner'.
> Do you mean that CreateRetriedClientRequestState could just pass in a point
Personally, I think the code is way more confusing this way.

You're not "moving" the TExecRequest from the original ClientRequestState to 
the new one - the TExecRequest is just owned by the QueryDriver, so you're 
moving it from one place in the QueryDriver to a different place in the 
QueryDriver.

It doesn't change at any point,

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-08 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 26:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6012/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 26
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 08 May 2020 20:07:10 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-08 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 25:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6011/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 25
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 08 May 2020 20:04:18 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-08 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 25:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc@937
PS24, Line 937: parent_query_dri
> This isn't valid. (and you have similar issues elsewhere)
Done


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@1
PS24, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
> The distinction isn't that clear, but it might make more sense to put this
I think client-request-state probably belongs in the runtime folder. I plan to 
move some of this around in IMPALA-9370. Left a comment to follow up on this in 
IMPALA-9370.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@39
PS24, Line 39: QueryHandle
> I think this name is confusing, esp. since there's already a QueryHandle in
I like referring to this as a handle, since its really just a pointer into 
other query level objects.
I think some of these re-factoring / re-renmaing decisions should be done in 
IMPALA-9370 so that we can tackle them all together and ensure that all class 
names are consistent with one another.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@82
PS24, Line 82: /// *Transparent* Query Retries:
> Not a huge deal in this patch since it's still hidden behind a flag and exp
That is a good point. I added some notes here and in the uber-JIRA as a 
disclaimer. It's a bit hard to change all the places where this is referred to 
as "transparent query retries" and I think many folks are already using the 
phrase. So I think at this point it is a matter of setting proper expectations. 
I didn't explicitly mention the GetRuntimeProfile() changes so we are planning 
to revise the behavior in IMPALA-9229
Yeah - I've tested this with Hue, and it works as expected. There are a few 
observability improvements that could be made to Hue to improve this. The 
uber-JIRA (IMPALA-9124) has some sub-tasks related to impala-shell usability 
improvements.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136:   /// query option 'retry_failed_queries') and if the query can 
be retried. Queries can
> The way you're both returning a Status and also returning some status infor
Doesn't return a Status anymore.
The input Status is used to set the error message "Max retry limit was hit" 
which does show up in the final error message (see 
TestQueryRetries::test_multiple_retries)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@189
PS24, Line 189:  owns this QueryDriver.
> ?
Forgot to remove this. Deleted.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@194
PS24, Line 194:
  :   /// The ClientRequestState
> ?
Forgot to remove this. Deleted.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204
PS24, Line 204:   /// 'RunFrontendPlanner'.
> Now that this is just move()-ed from 'exec_request_' I'm not sure why its s
Do you mean that CreateRetriedClientRequestState could just pass in a pointer 
to exec_request_ rather than retry_exec_request_ when creating the new 
ClientRequestState? I guess we could do that, although I think its a little 
easier to understand the code in its current form. you move the TExecRequest 
out from the original ClientRequestState into the new one.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@97
PS24, Line 97: if (client_request_state->fetched_rows()) {
> This function would be more readable if you turned these into:
Done


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@302
PS24, Line 302: try_request
> Instead of this make_unique/move() thing, why not just have SetOriginalId()
Done


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/service/impala-hs2-server.cc@884
PS24, Line 884:   QueryHandle query_handle;
> I'm not sure that in its current state QueryHandle is really doing much. Th
I think the point of the QueryHandle is to enforce that whenever users use a 
ClientRequestState, that the shared_ptr to the corresponding QueryDriver is in 
scope. QueryHandle might not be the perfect way to do this, but I think its 
better than the previous solution where you had to make sure the 
shared_ptr

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-08 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#26).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,482 insertions(+), 513 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/26
--
To view, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-08 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 25:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@642
PS25, Line 642: Status status = GetQueryHandle(query_id, _handle, 
/*return_unregistered=*/ true);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@644
PS25, Line 644:   ClientRequestState* request_state = 
query_handle.request_state;
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@710
PS25, Line 710: Status status = GetQueryHandle(query_id, _handle, 
/*return_unregistered=*/ true);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/14824/25/be/src/service/impala-server.cc@712
PS25, Line 712:   ClientRequestState* request_state = 
query_handle.request_state;
line has trailing whitespace



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 25
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 08 May 2020 19:14:33 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-08 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#25).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,480 insertions(+), 513 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/25
--
To view, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-07 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 24:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/coordinator.cc@937
PS24, Line 937: RETURN_IF_ERROR(
This isn't valid. (and you have similar issues elsewhere)

As far as I can tell, the only way that this returns is if we decide to retry 
the query and then Thread::Create returns an error, but even in that case we 
need to still finish this function and eg. call UpdateExecState below with the 
original error status.

In general, errors related to failing to retry aren't ever going to be overall 
query errors, and we'll always need to finish whatever processing we were doing 
for the original query, so probably all we need to do is log any errors from 
TryQueryRetry and move on. It may even be best to log the errors from within 
TryQueryRetry and not have it even return anything.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@1
PS24, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
The distinction isn't that clear, but it might make more sense to put this in 
/be/src/service, since right now its:

impala-server (src/service) -> query-driver (src/runtime) -> 
client-request-state (src/service) -> coordinator (src/runtime)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@39
PS24, Line 39: QueryHandle
I think this name is confusing, esp. since there's already a QueryHandle in 
beeswax. Its really more of a ClientRequestStateHandle, though that's wordy. 
Maybe CRSHandle? ClientRequestHandle?

It might even be worth renaming ClientRequestState, since if anything the 
QueryDriver is what really holds the current state of the client request. Maybe 
QueryInstance? Not a huge deal though


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@82
PS24, Line 82: /// *Transparent* Query Retries:
Not a huge deal in this patch since it's still hidden behind a flag and 
experimental, but I think that we should be cautious about saying that this is 
completely transparent, since it does change what clients see, potentially in 
ways that could break existing clients (eg. if you call GetRuntimeProfile() and 
then do an assert that the returned profile has the same query_id as what you 
requested, turning this on could cause that assert to fail).

Obviously at a minimum we need to test this with all the usual known clients 
(impala-shell, impyla, jdbc/odbc, Hue, etc.), and we'll probably be making 
changes to at least some of those clients for the sake of observability around 
this (eg. impala-shell might want to print a message about the query getting 
retried)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@136
PS24, Line 136:   Status TryQueryRetry(ClientRequestState* 
client_request_state, Status* status,
The way you're both returning a Status and also returning some status 
information in a Status out parameter seems confusing, and in fact it looks 
like at most of the call sites of this function the Status out parameter is 
never actually referenced again after the call so any info added to it is 
usually dropped.

I think its probably better to just not have 'status' be an out parameter and 
return use the returned status for everything, or possibly not return any 
status info at all (see my other comments)


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@189
PS24, Line 189: A shared_ptr is used to allow asynchronous deletion
?


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@194
PS24, Line 194: A shared_ptr is used to allow
  :   /// asynchronous deletion.
?


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.h@204
PS24, Line 204:   std::unique_ptr retry_exec_request_;
Now that this is just move()-ed from 'exec_request_' I'm not sure why its still 
needed.


http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/24/be/src/runtime/query-driver.cc@97
PS24, Line 97: if (!client_request_state->fetched_rows()) {
This function would be more readable if you turned these into:

if (reason_not_to_retry) {
LOG << reason;
return;
}
if (another_reason) {
...

(after doing that, I personally would get rid of RetryAsync() as 1) its only 
called in one place 2) a lot of its code is just DCHECK-ing things that the ifs 
here enforce, so that could just be removed and 3) the names

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-07 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 24:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5990/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 24
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 07 May 2020 15:50:54 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-07 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#24).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC-{H,DS} workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,458 insertions(+), 510 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/24
--
To view, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-06 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 23:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/5983/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 23
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 07 May 2020 01:06:30 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-06 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 23:

Fixed compilation issue. Rebased against and resolved some conflicts on top of 
IMPALA-9692.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 23
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 07 May 2020 00:54:35 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-06 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#22).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation
* Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,458 insertions(+), 510 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/22
--
To view, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-06 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 21: Code-Review+1

(2 comments)

This is looking good to me.

http://gerrit.cloudera.org:8080/#/c/14824/21/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/21/be/src/runtime/coordinator.cc@1071
PS21, Line 1071:   transform(failed_backend_states.begin(), 
failed_backend_states.end(),
   :   back_inserter(backend_addresses), [](BackendState* 
backend_state) -> string {
   : return 
TNetworkAddressToString(backend_state->krpc_impalad_address());
   :   });
Purely style nit: I think this would be cleaner as a for-loop.

for (BackendState* backend_state : failed_backend_states) {
  backend_addresses.push_back(TNetworkAddressToString(...));
}


http://gerrit.cloudera.org:8080/#/c/14824/21/be/src/runtime/coordinator.cc@1077
PS21, Line 1077:   for_each(failed_backend_states.begin(), 
failed_backend_states.end(),
   :   [_status](BackendState* backend_state) {
   : retryable_status.MergeStatus(
   : FromKuduStatus(backend_state->exec_rpc_status(), 
"Exec() rpc failed"));
   :   });
Purely style nit: I think this would be cleaner as a for-loop:

for (BackendState* backend_state : failed_backend_states) {
  retryable_status.MergeStatus(FromKuduStatus(...), ...);
}



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 21
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 23:14:56 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 21:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5973/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 21
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 03:56:38 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 19:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5972/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 19
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 03:53:21 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 19:

Rebased on top of IMPALA-9380, which had a bunch of merge conflicts that I had 
to resolve. Re-ran the tests from IMPALA-9380 and they pass.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 19
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 03:13:06 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#21).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,445 insertions(+), 510 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/21
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 19:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1153
PS19, Line 1153:
> line has trailing whitespace
Done


http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1310
PS19, Line 1310:   
RETURN_IF_ERROR(query_handle.request_state->Cancel(/*check_inflight=*/ true, 
/*cause=*/ nullptr));
> line too long (100 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 19
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 03:12:35 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 19:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1153
PS19, Line 1153:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/14824/19/be/src/service/impala-server.cc@1310
PS19, Line 1310:   
RETURN_IF_ERROR(query_handle.request_state->Cancel(/*check_inflight=*/ true, 
/*cause=*/ nullptr));
line too long (100 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 19
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 03:10:49 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#19).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/benchmarks/process-wide-locks-benchmark.cc
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,444 insertions(+), 510 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/19
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 18:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5971/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 18
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 02:10:43 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 17:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5970/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 17
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 02:09:40 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#18).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,372 insertions(+), 515 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/18
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h@249
PS17, Line 249:   virtual void Cancel(impala::TStatus& status, const 
beeswax::QueryHandle& beeswax_handle);
> line too long (91 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 17
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 01:25:59 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/17/be/src/service/impala-server.h@249
PS17, Line 249:   virtual void Cancel(impala::TStatus& status, const 
beeswax::QueryHandle& beeswax_handle);
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 17
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 01:24:24 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#17).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/testutil/impalad-query-executor.cc
M be/src/testutil/impalad-query-executor.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
28 files changed, 2,370 insertions(+), 514 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/17
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 16:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h@248
PS16, Line 248: A shared_ptr is used because the
  :   /// Coordinator can outlive its associated QueryDriver.
> This comment needs updating
Done


http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h@607
PS16, Line 607: A shared_ptr is used because the
  :   /// ClientRequestState can outlive its associated QueryDriver.
> Update this comment (or remove)
Done


http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc@212
PS16, Line 212: query_handle;
> Nit: It would be nice to standardize the naming for variables/arguments for
Done


http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h@933
PS16, Line 933:   void TUniqueIdToQueryHandle(const TUniqueId& query_id, 
beeswax::QueryHandle* handle);
  :   void QueryHandleToTUniqueId(const beeswax::QueryHandle& 
handle, TUniqueId* query_id);
> Nit: Should we rename this to emphasize that this is a beeswax::QueryHandle
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 16
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 06 May 2020 01:22:46 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-05 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 16:

(4 comments)

Took another pass. It is mostly nits about beeswax::QueryHandle vs 
impala::QueryHandle and some stale comments.

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/runtime/coordinator.h@248
PS16, Line 248: A shared_ptr is used because the
  :   /// Coordinator can outlive its associated QueryDriver.
This comment needs updating


http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/client-request-state.h@607
PS16, Line 607: A shared_ptr is used because the
  :   /// ClientRequestState can outlive its associated QueryDriver.
Update this comment (or remove)


http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-beeswax-server.cc@212
PS16, Line 212: query_handle;
Nit: It would be nice to standardize the naming for variables/arguments for 
beeswax::QueryHandle vs impala::QueryHandle across the functions in this file. 
e.g. the beeswax::QueryHandle is always 'handle' or 'beeswax_handle' or 
whatever and the impala::QueryHandle is always 'query_handle' or 
'impala_query_handle' or whatever. I think I would lean toward 
beeswax::QueryHandle -> 'beeswax_handle' and impala::QueryHandle -> 
'query_handle', but I'm flexible.

This variability predates your change, but I don't think there is anything 
stopping us from standardizing.


http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/16/be/src/service/impala-server.h@933
PS16, Line 933:   void TUniqueIdToQueryHandle(const TUniqueId& query_id, 
beeswax::QueryHandle* handle);
  :   void QueryHandleToTUniqueId(const beeswax::QueryHandle& 
handle, TUniqueId* query_id);
Nit: Should we rename this to emphasize that this is a beeswax::QueryHandle? 
Something like:
TUniqueIdToBeeswaxHandle()
BeeswaxHandleToTUniqueId()



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 16
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 05 May 2020 23:38:53 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-03 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 16:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5941/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 16
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 04 May 2020 01:52:24 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-03 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169
PS13, Line 169: _state_;
> I should have mentioned that the shared_ptr approach also requires special
i decided just to create a struct called QueryHandle that contains a shared_ptr 
to the QueryDriver and a naked pointer to the ClientRequestState. I reviewed 
all the places where a QueryDriver / ClientRequestState are used and make sure 
the ClientRequestState is only used when the QueryDriver is still in scope. I 
used new QueryDriver struct where I thought it was useful (e.g. 
ImpalaServer::GetQueryHandle).
i added some docs that explain the ClientRequestState should only be used when 
a valid QueryDriver is in scope.
since the QueryDriver must be in scope while the ClientRequestState is being 
used, I changed the ownership of the ClientRequestState back to unique_ptr and 
changed the Get*ClientRequestState() methods to return a naked pointer.


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763
PS9, Line 763:   ClientRequestState* request_state = nullptr;
> yeah, thats a good point, thanks for catching that. changed it to use a sha
i ended it up changing it back to unique_ptr. I changed the semantics a bit so 
that whenever you use a ClientRequestState, the corresponding QueryDriver 
shared_ptr must have an active reference on the stack. I added a new struct 
called QueryHandle that contains a shared_ptr to a QueryDriver and a naked 
pointer to a ClientRequestState. the idea is that the QueryHandle will help 
insure a QueryDriver reference is alive when accessing a ClientRequestState.
there are a few places where it didn't make sense to use the new QueryHandle, 
so I just spot checked that whenever a ClientRequestState is used, the 
corresponding QueryDriver is still in scope.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 04 May 2020 01:06:31 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-03 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#16).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py; looped tests for a few days
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue
* Ran core tests and query retry stress test against an ASAN build
* Ran concurrent_select.py to stress query cancellation

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,328 insertions(+), 473 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/16
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169
PS13, Line 169:
> Fixed. The Coordinator and ClientRequestState now take a shared_ptr to the
I should have mentioned that the shared_ptr approach also requires special 
teardown. We need to break the cycle of shared_ptrs, so there isn't a resource 
leak.

If A holds a shared_ptr to B and B holds a shared_ptr to A, they will both 
continue to exist even if the last external shared_ptr is destructed.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 01 May 2020 22:52:27 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5928/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 15
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 01 May 2020 22:17:58 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5927/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 14
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 01 May 2020 22:16:10 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc@163
PS14, Line 163:   ::RetryQueryFromThread, this, error, 
query_driver, _query_thread_));
> line too long (92 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 14
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 01 May 2020 21:34:45 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#15).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,301 insertions(+), 427 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/15
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 15
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer:

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 13:

(5 comments)

addressed comments, fixed a possible race condition when a query gets cancelled 
and unregistered while a retry is being setup.

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169
PS13, Line 169:
> ClientRequestState has a pointer back to the QueryDriver (as does the Coord
Fixed. The Coordinator and ClientRequestState now take a shared_ptr to the 
QueryDriver


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@161
PS13, Line 161: void QueryDriver::RetryQueryFromThread(const Status& error) {
> Is there anything that prevents the QueryDriver from being deleted while th
hmm yeah that's true. I changed it so that RetryQueryFromThread takes in a 
shared_ptr to the current QueryDriver. its a bit odd that the method has to 
take in a reference to itself, but I couldn't see a better way.


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@178
PS13, Line 178:   // Cancel the query.
> Is this necessary? At least in the case of Coordinator calling TryQueryRetr
I don't think 'ImpalaServer::CancelFromThreadPool' (which calls TryQueryRetry() 
will call Cancel. Its probably possible to remove this, but I thought it was a 
nice property that you want to ensure the query is cancelled before trying to 
retry it.


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@179
PS13, Line 179: true, nullptr
> Both of these parameters seem wrong - you note above that a query could get
I changed the 'check_inflight' parameter to false.
ClientRequestState::MarkAsRetrying explicitly sets the query_status_ in the 
ClientRequestState. The issue with calling Cancel with a cause is that it will 
call UpdateQueryStatus, which sets the ExecState to ERROR, which you don't want 
because the ERROR state will get exposed to the client, which you don't want to 
happen since the query is being retried.


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@281
PS13, Line 281:   // TODO: IMPALA-9502: Avoid copying TExecRequest when 
retrying queries
> Out of curiosity, what are the issues here?
hmm can't remember, I replaced this with 'move(exec_request_)' and all the 
tests and a run of the stress test passes. I added this a long time ago on a 
much older version of this feature, so it is possible it got fixed as some 
point inadvertently.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 01 May 2020 21:33:54 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/14/be/src/runtime/query-driver.cc@163
PS14, Line 163:   ::RetryQueryFromThread, this, error, 
query_driver, _query_thread_));
line too long (92 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 14
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 01 May 2020 21:32:37 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#14).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* There are several limitations that are listed out in the parent JIRA

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,300 insertions(+), 427 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/14
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 14
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer:

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-05-01 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.h@169
PS13, Line 169:
ClientRequestState has a pointer back to the QueryDriver (as does the 
Coordinator), so if ClientRequestState outlives the QueryDriver, then that 
pointer would be dangling.

I think the simplest lifetime semantic is for the QueryDriver to live as long 
as any child ClientRequestStates is alive.

I think we are pretty close to that being true. Most code looks up a 
QueryDriver and then looks up the ClientRequestState with the QueryDriver 
shared_ptr staying on the stack. One option is to formalize it and have a 
struct that contains a shared_ptr to the QueryDriver and then a pointer to 
ClientRequestState. If everything uses that struct when operating on a 
ClientRequestState, then the QueryDriver will always outlive its children 
ClientRequestStates. This is just a sketch, so I might be missing something.

An implicit way to do this is to have a shared_ptr from ClientRequestState to 
the QueryDriver.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 01 May 2020 17:48:05 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-30 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 13:

(4 comments)

Still going through it

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@161
PS13, Line 161: void QueryDriver::RetryQueryFromThread(const Status& error) {
Is there anything that prevents the QueryDriver from being deleted while this 
thread is running? eg. what if the query fails and then while the retry is 
happening the user cancels the query? Mostly we're relying on the shared_ptr to 
prevent those sorts of issues, but I don't think we're holding a shared_ptr for 
the duration of this, since you have to just pass a bare 'this' to 
Thread::Create


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@178
PS13, Line 178:   // Cancel the query.
Is this necessary? At least in the case of Coordinator calling TryQueryRetry() 
we'll end up calling cancel from the Coordinator too when the status is updated 
to an error. I guess its fine because Cancel() is idempotent, but might be 
worth thinking about what we can do to make this cleaner.


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@179
PS13, Line 179: true, nullptr
Both of these parameters seem wrong - you note above that a query could get 
retried when its still in the INITIALIZED state, so seems like 
'check_inflight=true' could fail, and according to the comment about 
ClientRequestState::Cancel 'nullptr' for cause is supposed to indicate a 
user-initiated cancel that doesn't have an associated error, which this isn't.

I think in the case of an error that comes from the query itself (i.e. 
Coordinator call TryQueryRetry()) you'll get the error set anyways when 
UpdateExecState() is called (unless the AuxErrorInfo comes in a different rpc 
than the overall error message, in which case I think there's a race between 
the coordinator getting cancelled and no longer accepting reports, though I'm 
not sure), and in the case of a backend crashing (i.e. CancelFromThreadPool 
calls TryQueryRetry()), I don't think the error will get set.


http://gerrit.cloudera.org:8080/#/c/14824/13/be/src/runtime/query-driver.cc@281
PS13, Line 281:   // TODO: IMPALA-9502: Avoid copying TExecRequest when 
retrying queries
Out of curiosity, what are the issues here?



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 23:04:08 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-30 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#13).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* IMPALA-9225: A query cannot be retried once any results from the original
  query have been fetched
* IMPALA-9200: A query can currently only be retried once

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,267 insertions(+), 427 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/13
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-30 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 13:

(2 comments)

addressed remaining comments

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@502
PS9, Line 502:   && !backend_state->exec_rpc_status().IsAborted())
> For this loop, do we have any expectation about the number of BackendStates
yeah, good point. fixed.


http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py@93
PS9, Line 93: Increase the statestore
: heartbeat frequency so that the query actually fails during 
execution.
> It looks like we are using the default value for statestore_heartbeat_frequ
yeah - the tests infra actually sets a lower value for this: 
https://github.com/apache/impala/blob/master/tests/common/custom_cluster_test_suite.py#L59

yes, your understanding is correct. a RPC should fail which should cause the 
query to be retried.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 12:53:15 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5916/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 00:50:28 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5915/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 00:48:34 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5914/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 00:47:53 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc@70
PS11, Line 70: shared_ptr 
QueryDriver::GetClientRequestState(const TUniqueId& query_id) {
> line too long (94 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 00:06:15 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#12).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* IMPALA-9225: A query cannot be retried once any results from the original
  query have been fetched
* IMPALA-9200: A query can currently only be retried once

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,259 insertions(+), 433 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/12
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 11:

(10 comments)

Patch set 10 is a rebase. Patch set 11 address the following comments. 
Addressed / responded to the rest of the comments from Thomas. Addressed most 
(but not all) comments from Joe.

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@113
PS9, Line 113: (2) the
 :   /// query has not already been ret
> not sure what this is supposed to mean
revised


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@116
PS9, Line 116: d if
 :   /// there has been a c
> I think this should be singular "a statestore update"
Done


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@122
PS9, Line 122:   /// is set to true if the query was actually retried, false 
otherwise. This method is
 :   /// idempotent, it can safely be called multiple tim
> From reading through TryQueryRetry(), it looks like this is safe to call mu
updated docs to mention that method is idempotent


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h@84
PS9, Line 84: ;
> I don't think this is used anywhere?
Done


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@148
PS9, Line 148:   shared_ptr query_driver;
> I don't think this is used anywhere?
Done


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763
PS9, Line 763:   HS2_RETURN_IF_ERROR(
> I had the same concern.
yeah, thats a good point, thanks for catching that. changed it to use a 
shared_ptr


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@1039
PS9, Line 1039:   // The original query profile should still be accessible via 
the web ui. Don't use
> I think its definitely necessary that we provide a way for all relevant pro
agree this is an issue. the reason i went with the current approach is so that 
all methods in the hs2/beeswax interface follow the same "transparent" pattern.
i think it might be slightly more intuitive to return the profile of the 
successful query rather than the originally failed one. although I think 
returning the original, failed profile could make sense as well. I don't know 
if one approach is necessarily that much better than the other.
i agree that there should be a way to fetch the failed profiles via the hs2 
interface, I'm planning to tackle that in IMPALA-9229


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h@673
PS9, Line 673: std::shared_ptr Stylistically, I personally don't like mutable references for out arguments
Done since I had to change this all the shared_ptr, so now it is 
"shared_ptr*"


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc@1453
PS9, Line 1453:   // ClientRequestState::output_exprs_, which
> Nit: you could move this declaration down past the first BlockOnWait()
Done


http://gerrit.cloudera.org:8080/#/c/14824/9/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/14824/9/common/thrift/ImpalaService.thrift@523
PS9, Line 523:
> typo
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 00:04:51 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/14824/11/be/src/runtime/query-driver.cc@70
PS11, Line 70: shared_ptr 
QueryDriver::GetClientRequestState(const TUniqueId& query_id) {
line too long (94 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 30 Apr 2020 00:04:39 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#11).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* IMPALA-9225: A query cannot be retried once any results from the original
  query have been fetched
* IMPALA-9200: A query can currently only be retried once

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,258 insertions(+), 433 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/11
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-29 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#10).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* IMPALA-9225: A query cannot be retried once any results from the original
  query have been fetched
* IMPALA-9200: A query can currently only be retried once

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,253 insertions(+), 448 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/10
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-27 Thread Joe McDonnell (Code Review)

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 9:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@502
PS9, Line 502: RETURN_IF_ERROR(HandleFailedExecRpc(backend_state));
For this loop, do we have any expectation about the number of BackendStates 
that have errors? In other words, can more than one have errors?

If there can be more than one error, we would call TryQueryRetry more than 
once, but based on my reading of that function, it can handle that. Am I right?

Is there any concern that if there is more than one error, we may want to 
blacklist more than one Impalad before retrying?


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@116
PS9, Line 116: a
 :   /// statestore updates
I think this should be singular "a statestore update"


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@122
PS9, Line 122:   Status TryQueryRetry(ClientRequestState* client_request_state, 
Status* status,
 :   bool* was_retried = nullptr) WARN_UNUSED_RESULT;
>From reading through TryQueryRetry(), it looks like this is safe to call 
>multiple times and it will only do the retry once. Is that true? Can we add a 
>comment here that spells out that it can be called more than once?


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763
PS9, Line 763:   ClientRequestState* request_state = nullptr;
> So I'm wondering if its safe to no longer have any shared_ptr here and in t
I had the same concern.


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.h@673
PS9, Line 673: ClientRequestState*&
Stylistically, I personally don't like mutable references for out arguments. 
Using ** instead makes it much clearer that values are being passed between 
functions. Callers need to pass addresses. The function itself can't treat it 
like a local variable.


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-server.cc@1453
PS9, Line 1453:   ClientRequestState::RetryState retry_state;
Nit: you could move this declaration down past the first BlockOnWait()


http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/14824/9/tests/custom_cluster/test_query_retries.py@93
PS9, Line 93: Increase the statestore
: heartbeat frequency so that the query actually fails during 
execution.
It looks like we are using the default value for 
statestore_heartbeat_frequency_ms. Is that intentional?

If I understand this correctly, this test uses the heavy shuffle query and 
should see an RPC failure. We are not expecting the statestore to see the 
Impalad as dead. Is that right?



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 27 Apr 2020 17:48:11 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-09 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 9:

(2 comments)

responded to a few comments from Thomas, still thinking through some of his 
other comments.

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@922
PS9, Line 922:   
parent_query_driver_->TryQueryRetry(parent_request_state_, _status));
yeah, agree it is unfortunate. although i think the issue has always been 
there. the coordinator class already has a parent_request_state_ that points to 
the owning ClientRequestState. the ClientRequestState has a similar pattern, it 
has a parent_server_ which points to the owning ImpalaServer.
i'm not sure this is overall a major problem, it does make the code a bit 
confusing, but I've done my best to limit the usage of the parent_query_driver_ 
class. only the TryQueryRetry method is ever called on the parent_query_driver_.
yeah, it does take the coordinator lock, but only for a pretty short period of 
time - it just checks some state. the actual work to schedule and run the 
retried query is done in a separate thread, that doesn't require holding the 
coordinator lock. acquiring the lock should only be done rarely - only when a 
query fails with a retryable error.

> Would it be possible to instead have the QueryDriver wait on the coordinator 
> to finish and then check its status and decide whether to retry then?

> One problem is the QueryDriver needs to know not just if the query hit an 
> error but if the error was something retryable, but we could do something 
> like have the coordinator remember any nodes it blacklists and expose that 
> info to the QueryDriver.

yeah, I considered this at some point, and it might be a cleaner design, just 
not sure how much re-factoring it is going to require.
i can convince myself the current approach of just calling TryRetryQuery in the 
coordinator is fine as well. I think they can be thought of as "callback" 
functions into the QueryDriver that get triggered under specific conditions.

if you still feel strongly about it, i can do some digging, but would prefer to 
make the code changes in a follow up patch because i don't think it will be a 
straightforward / small change to make, and i don't want to expand the scope of 
this patch further.


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@1060
PS9, Line 1060:   
ExecEnv::GetInstance()->cluster_membership_mgr()->BlacklistExecutor(
> This of course doesn't actually guarantee that the retried query won't be s
that's a good point, filed IMPALA-9636 to fix this.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 09 Apr 2020 17:57:32 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-04-07 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 9:

(8 comments)

Still going through this, but some high level thoughts (and a few nit-picks I 
happened to notice already)

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@922
PS9, Line 922:   
parent_query_driver_->TryQueryRetry(parent_request_state_, _status));
So obviously the circular dependency here between QueryDriver, 
ClientRequestState, and Coordinator is unfortunate.

It probably works as is, and it might be difficult to get rid of without 
significant code restructuring, but I'm concerned its confusing/brittle (eg. 
this call results in us taking ClientRequestState::lock_ on the 
ReportExecStatus rpc thread) and so I've been thinking through the options.

Would it be possible to instead have the QueryDriver wait on the coordinator to 
finish and then check its status and decide whether to retry then?

One problem is the QueryDriver needs to know not just if the query hit an error 
but if the error was something retryable, but we could do something like have 
the coordinator remember any nodes it blacklists and expose that info to the 
QueryDriver.

Another issue is that it means we won't start the retry quite as quickly, since 
we have to wait for the QueryDriver to notice the error, but that might be fine 
- its already the case that Coordinator::Wait() will return immediately after 
an error is reported, without waiting for other backends to be cancelled, see 
HandleExecStateTransition()->CancelBackends()->backend_exec_complete_barrier_->NotifyRemaining().
 The worse case is when the blacklisting info and the error status don't arrive 
in the same report, eg. the call to TryQueryRetry below, but that should be 
rare since the error status will have been generated at the same time as the 
AuxErrorInfo, so you just have to get pretty unlucky with the timing of when 
the report is generated.

Maybe the bigger issue is I'm not sure its easy to do with the way the code is 
set up. I was trying to trace through the various 
Wait()/WaitAsync()/BlockOnWait() whatever calls but I'm not sure how that all 
works now.


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/coordinator.cc@1060
PS9, Line 1060:   
ExecEnv::GetInstance()->cluster_membership_mgr()->BlacklistExecutor(
This of course doesn't actually guarantee that the retried query won't be 
scheduled on the executor that gets blacklisted, eg. if it takes longer for the 
query to get through admission control again than the blacklist timeout.

We might want to consider doing something like passing through a list of 
executors to reschedule on, to avoid having queries repeatedly fail in the same 
way. On the other hand, the blacklist timeout was designed to be longer than it 
should take the statestore to notice the executor is down, so in theory if the 
executor really is down maybe it doesn't matter.

Probably not necessary to do anything different for this patch, though, just 
wanted to point this out.


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h
File be/src/runtime/query-driver.h:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/runtime/query-driver.h@113
PS9, Line 113: (2) the
 :   /// query has already been retried
not sure what this is supposed to mean


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/client-request-state.h@84
PS9, Line 84: UNKNOWN
I don't think this is used anywhere?


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@148
PS9, Line 148:   unique_ptr exec_request = 
make_unique();
I don't think this is used anywhere?


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@763
PS9, Line 763:   ClientRequestState* request_state = nullptr;
So I'm wondering if its safe to no longer have any shared_ptr here and in the 
other places in this file - what's to stop the QueryDriver from getting deleted 
by Unregister while this function is executing, which would make this pointer 
no longer valid?


http://gerrit.cloudera.org:8080/#/c/14824/9/be/src/service/impala-hs2-server.cc@1039
PS9, Line 1039:   // If the query was retried, fetch the profile for the most 
recent attempt of the query
I think its definitely necessary that we provide a way for all relevant 
profiles to be accessed through HS2, not just the webserver.

This is one case where the

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-03-31 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 9:

After a lot of cleanup, doc updates, and testing, I think this patch is finally 
ready to go (open to general comments). I made a big change to the commit 
message to describe the patch + all the testing I have done. There are a few 
TODOs sprinkled around the code, but non are essential (most are cleanup 
related).


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 31 Mar 2020 18:45:21 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-03-31 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5664/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 31 Mar 2020 18:41:52 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-03-31 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5663/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 31 Mar 2020 18:39:40 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-03-31 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#9).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* IMPALA-9225: A query cannot be retried once any results from the original
  query have been fetched
* IMPALA-9200: A query can currently only be retried once

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,253 insertions(+), 448 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/9
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-03-31 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py
File tests/stress/query_retries_stress_runner.py:

http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@160
PS8, Line 160: e
> flake8: E722 do not use bare except'
Done


http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@164
PS8, Line 164: e
> flake8: E722 do not use bare except'
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 31 Mar 2020 18:08:24 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-03-31 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py
File tests/stress/query_retries_stress_runner.py:

http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@160
PS8, Line 160: e
flake8: E722 do not use bare except'


http://gerrit.cloudera.org:8080/#/c/14824/8/tests/stress/query_retries_stress_runner.py@164
PS8, Line 164: e
flake8: E722 do not use bare except'



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 31 Mar 2020 18:05:52 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-03-31 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#8).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes (IMPALA-9124).

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the removed node, queries are retried), or (2) if a
query fails and as a result, blacklists a node. Either event is
considered a cluster membership change as it affects what nodes a query
will be scheduled on. The assumption is that a retry of the query with
the updated cluster membership will succeed.

A query retry is modelled as a brand new query, with its own query id.
This simplifies the implementation and the resulting runtime profiles
when queries are retried.

Core Features:
* Retries are transparent to the user; no modification to client
  libraries are necessary to support query retries
* Retried queries skip all fe/ parsing, planning, authorization, etc.
* Retries are configurable ('retry_failed_queries') and are off by
  default

Implementation:
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A new layer of abstraction between the ImpalaServer and
  ClientRequestState has been added; it is called the QueryDriver
* Each ClientRequestState is treated as a single attempt of a query, and
  the QueryDriver owns all ClientRequestStates for a query
* ClientRequestState has a new state object called RetryState; a
  ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED
* The QueryDriver owns the TExecRequest for the query as well, it is
  re-used for each query retry

Observability:
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* Runtime profiles of queries that fail and then are retried will have:
* "Retry Status: RETRIED"
* "Retry Cause: [the error that triggered the retry]"
* "Retried Query Id: [the query id of the retried query]"
* Runtime profiles of the retried query (e.g. the second attempt of the
  query) will include:
* "Original Query Id: [the query id of the original query]"
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state

Testing:
* Added E2E tests in test_query_retries.py
* Added a stress test query_retries_stress_runner.py that runs concurrent
  streams of a TPC workload and randomly kills impalads
* Ran the stress test with various configurations: tpch on parquet,
  tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on
  parquet (one stream), tpch on text, tpcds on text
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures
* Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted
  impalads, and manually verified that queries were retried
* Manually tested retries work with various clients, specifically the
  impala-shell and Hue

Limitations:
* IMPALA-9225: A query cannot be retried once any results from the original
  query have been fetched
* IMPALA-9200: A query can currently only be retried once

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/CMakeLists.txt
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
A be/src/runtime/query-driver.cc
A be/src/runtime/query-driver.h
M be/src/service/CMakeLists.txt
D be/src/service/client-request-state-map.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/control-service.cc
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
R be/src/service/query-driver-map.cc
A be/src/service/query-driver-map.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
M tests/common/impala_cluster.py
M tests/common/impala_service.py
A tests/custom_cluster/test_query_retries.py
A tests/stress/query_retries_stress_runner.py
26 files changed, 2,253 insertions(+), 448 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/8
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-24 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 7:

(1 comment)

> Instead of creating a separate doc, think it might be easier to
 > just post comments here.
 >
 > I've done some re-factoring locally and this is what I have so far:
 >
 > * Created a "QueryDriver", which (as suggested by Thomas) is a
 > layer of abstraction of on top of ClientRequestState
 > * "QueryDriver" owns the ClientRequestState; all access to the
 > ClientRequestState goes through the QueryDriver
 > * Since the first patch will only support a single retry, the
 > QueryDriver has a regular client_request_state_ and a
 > retried_client_request_state_ (one for the original query and the
 > other for the retried query)
 > * QueryDriver owns the TExecRequest that is used by both the
 > original and retried queries
 > * The "transparent" part of this feature is now handled by the
 > QueryDriver (since the QueryDriver owns all access to the
 > ClientRequestState)
 > * The QueryDriver has two methods: GetActiveClientRequestState()
 > and GetClientRequestState(query_id)
 > * GetActiveClientRequestState() returns the CRS for the "active"
 > query - e.g. if the query has not been retried yet, it returns the
 > original query; if the query has been retried, it returns the
 > retried query
 > * GetClientRequestState(query_id) returns the CRS for the query
 > that matches the given 'query_id' - this allows clients to get the
 > CRS of exactly the query_id they are looking for
 > * ImpalaServer creates a QueryDriver for each query; the
 > ImpalaServer::client_request_state_map_ has been replaced with the
 > ImpalaServer::query_driver_map_ which maps query ids to
 > QueryDrivers
 > * The query_driver_map_ can have key, values pairs with the same
 > value (e.g. the same QueryDriver)
 > * Moved all of the retry code (along with the retry threadpool)
 > into QueryDriver (specifically QueryDriver::RetryQueryFromThreadPool)
 >
 > Example:
 >
 > * Client submits a query for execution
 > * Impala creates a QueryDriver for the query
 > * QueryDriver::CreateClientRequestState creates a ClientRequestState
 > for the query (this CRS is owned by the QueryDriver)
 > * ImpalaServer::RegisterQuery "registers" the query and adds an
 > entry to ImpalaServer::query_driver_map_ that maps the query id to
 > the QueryDriver
 > * QueryDriver::RunFrontendPlanner creates the TExecRequest (owned
 > by the QueryDriver)
 > * Eventually, the query fails and is retried
 > * QueryDriver::RetryQueryFromThreadPool contains all the retry
 > logic (moved from ImpalaServer)
 > * It cancels the original query, creates the new one and runs it
 > * It calls ImpalaServer::RegisterQuery again, but with the retried
 > query id
 > * It manipulates some internal pointers so that retried_client_request_state_
 > now points to the CRS of the retried query

That all sounds good to me

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025
PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()());
> I'll try thinking through this some more. One challenge with creating an in
Sure, I think this can probably be deferred for now and we can go with your 
original approach.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 24 Feb 2020 21:25:44 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-13 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025
PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()());
> I sort of did this in the previous patch update. I added a client_query_id_
I'll try thinking through this some more. One challenge with creating an 
internal vs. user facing id is that a common debugging practice is to take user 
facing id and use it to grep through the logs. A lot of the log statements 
contain the query id they are associated with. So the user would have to 
manually figure out the mapping from the user facing id (perhaps it got the id 
from the runtime profile or the client logs) and the internal id.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 13 Feb 2020 19:06:57 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-13 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 7:

Instead of creating a separate doc, think it might be easier to just post 
comments here.

I've done some re-factoring locally and this is what I have so far:

* Created a "QueryDriver", which (as suggested by Thomas) is a layer of 
abstraction of on top of ClientRequestState
* "QueryDriver" owns the ClientRequestState; all access to the 
ClientRequestState goes through the QueryDriver
* Since the first patch will only support a single retry, the QueryDriver has a 
regular client_request_state_ and a retried_client_request_state_ (one for the 
original query and the other for the retried query)
* QueryDriver owns the TExecRequest that is used by both the original and 
retried queries
* The "transparent" part of this feature is now handled by the QueryDriver 
(since the QueryDriver owns all access to the ClientRequestState)
* The QueryDriver has two methods: GetActiveClientRequestState() and 
GetClientRequestState(query_id)
* GetActiveClientRequestState() returns the CRS for the "active" query - e.g. 
if the query has not been retried yet, it returns the original query; if the 
query has been retried, it returns the retried query
* GetClientRequestState(query_id) returns the CRS for the query that matches 
the given 'query_id' - this allows clients to get the CRS of exactly the 
query_id they are looking for
* ImpalaServer creates a QueryDriver for each query; the 
ImpalaServer::client_request_state_map_ has been replaced with the 
ImpalaServer::query_driver_map_ which maps query ids to QueryDrivers
* The query_driver_map_ can have key, values pairs with the same value (e.g. 
the same QueryDriver)
* Moved all of the retry code (along with the retry threadpool) into 
QueryDriver (specifically QueryDriver::RetryQueryFromThreadPool)

Example:

* Client submits a query for execution
* Impala creates a QueryDriver for the query
* QueryDriver::CreateClientRequestState creates a ClientRequestState for the 
query (this CRS is owned by the QueryDriver)
* ImpalaServer::RegisterQuery "registers" the query and adds an entry to 
ImpalaServer::query_driver_map_ that maps the query id to the QueryDriver
* QueryDriver::RunFrontendPlanner creates the TExecRequest (owned by the 
QueryDriver)
* Eventually, the query fails and is retried
* QueryDriver::RetryQueryFromThreadPool contains all the retry logic (moved 
from ImpalaServer)
* It cancels the original query, creates the new one and runs it
* It calls ImpalaServer::RegisterQuery again, but with the retried query id
* It manipulates some internal pointers so that retried_client_request_state_ 
now points to the CRS of the retried query


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 13 Feb 2020 19:05:45 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-10 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5186/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 10 Feb 2020 17:36:57 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-10 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 6:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/5184/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 10 Feb 2020 17:34:03 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-10 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc@1463
PS6, Line 1463: void ImpalaServer::BlockOnWait(shared_ptr* 
request_state, bool* timed_out,
> line too long (94 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 10 Feb 2020 16:51:50 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-10 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#7).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes.

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the leaving node, queries are retried), and (2) if a
query fails and as a result, blacklists a node. Both events are
considered cluster membership changes as they affect what nodes a query
will be scheduled on.

Implementation:
* Query retries are driven by a dedicated threadpool
* ImpalaServer::RetryQueryFromThreadPool implements the core logic to
  actually retry a failed query.
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A query cannot be retried once any results from the original query
  have been fetched, this is to prevent users from seeing incorrect results

Features:
* Retries are transparent to the user
* This is achieved by adding a mapping from failed query ids to the
  query id of the retried query
* ImpalaServer uses this mapping in GetClientFacingRequestState
  which is used to differentiate between "client facing" requests
  for a ClientRequestState vs. internal requrets for a CRS
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* "Impala Query Status" is a new field in runtime profiles that
  displays the ClientRequestState execution state (which includes
  the RETRYING and RETRIED states)
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state
* Retried queries skip all fe/ planning, authorization, etc.
* This feature is configurable ('retry_failed_queries') and is off by
  default

Refactoring:
* Changes the ClientRequestState so that it can take in an existing
  TExecRequest
* This is required when retrying queries because the
  TExecRequest of the failed query is copied and used for the
  ClientRequestState of the retried query
* ClientRequestState::ExecState is extended with three new states:
  RETRYING, RETRIED, and UNKNOWN.

Testing:
* Added integration tests in test_query_retries.py, these tests
  consistently pass when run locally
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures

TODO:
* Additional re-factoring / code cleanup
* Lots more documentation

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
A be/src/service/retry-work.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_query_retries.py
16 files changed, 793 insertions(+), 168 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/7
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-10 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/6/be/src/service/impala-server.cc@1463
PS6, Line 1463: void ImpalaServer::BlockOnWait(shared_ptr* 
request_state, bool* timed_out,
line too long (94 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 10 Feb 2020 16:51:03 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-10 Thread Sahil Takiar (Code Review)

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 5:

(7 comments)

Addressed some of the comments and filed a few follow up JIRAs. Going to start 
a doc to discuss the "transparent" part of this feature (e.g. the internal 
query ids). Working on doing some of the re-factoring you mentioned (e.g. 
adding another layer of abstraction on top of ClientRequestState).

http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@26
PS5, Line 26: A query cannot be retried once any results from the original query
:   have been fetched, this is to prevent users from seeing 
incorrect results
> Do we have any mechanism for a user to specify that they want their results
Not yet, but planning to tackle that in a follow up JIRA: IMPALA-9225


http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@41
PS5, Line 41: The Impala Web UI will list all retried queries as being in the
:   "RETRIED" state
> This may be a reasonable approach, but I think we should think about the us
Yeah, these are legitimate issues. We have few follow up JIRAs to tackle some 
of these problems: IMPALA-9229, IMPALA-9230, IMPALA-9213 and I just filed 
IMPALA-9364 as well - open to more suggestions about how to improve 
supportability.

For this patch, I think it should be sufficient to add the original query id to 
the retried query profile, and vice versa.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h
File be/src/service/client-request-state-map.h:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h@38
PS5, Line 38: bool overwrite = false
> I don't think this is being used anywhere anymore.
Done


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h@82
PS5, Line 82: RETRYING, RETRIED, UNKNOWN
> I think this approach, and the way you have to save the previous state and
Yeah, thats a good suggestion. Haven't looked into adding the 
ClientRequestState wrapper yet, but for now just split out the the RETRYING and 
RETRIED states into a `enum RetryState`. It actually simplifies the state 
management considerably, so thanks for the suggestion.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@632
PS5, Line 632: shared_ptr request_state = 
GetClientFacingRequestState(query_id);
> Unless I'm missing something in this patch, this means that only the profil
hmm not sure why I did that, changed back to GetClientRequestState


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@920
PS5, Line 920:   unique_ptr exec_request = 
make_unique();
> So I'm wondering if this patch would benefit from adding another layer of a
I think that makes sense. IMO the ImpalaServer <--> ClientRequestState <--> 
Coordinator code needs some re-factoring in general.

I wanted to some general re-factoring all this code (ImpalaServer, 
ClientRequestState, Coordinator) so I filed IMPALA-9370 as a follow up, but I 
think it makes sense to move some of the retry logic into its own class in this 
patch.

I re-factored the BlockOnWait code so that it isn't duplicated across the hs2 
and beeswax server.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025
PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()());
> As mentioned before, I definitely think it would be awesome if we could cre
I sort of did this in the previous patch update. I added a 
client_query_id_mapping_ that allows an optional mapping between query ids. It 
doesn't exactly create an "internal" id, but its cleaner that what I was doing 
before.

I think I'm going to resurrect the design doc for this feature and write up 
some notes on how best to approach this. We can finalize the design there.

The mixing of the hi/lo of the TUniqueId is an interesting idea. I'll be sure 
to include that.



--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 10 Feb 2020 16:50:40 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-02-10 Thread Sahil Takiar (Code Review)

Hello Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14824

to look at the new patch set (#6).

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..

IMPALA-9199: Add support for single query retries on cluster membership changes

Adds the core logic for transparently retrying queries that fail due to
cluster membership changes.

Query retries are triggered if (1) a node has been removed from the
cluster membership by a statestore update (rather than cancelling all
queries running on the leaving node, queries are retried), and (2) if a
query fails and as a result, blacklists a node. Both events are
considered cluster membership changes as they affect what nodes a query
will be scheduled on.

Implementation:
* Query retries are driven by a dedicated threadpool
* ImpalaServer::RetryQueryFromThreadPool implements the core logic to
  actually retry a failed query.
* When a query is retried, the original query is cancelled, the new
  query is created, registered, and started, and then the original query
  is closed
* A query cannot be retried once any results from the original query
  have been fetched, this is to prevent users from seeing incorrect results

Features:
* Retries are transparent to the user
* This is achieved by adding a mapping from failed query ids to the
  query id of the retried query
* ImpalaServer uses this mapping in GetClientFacingRequestState
  which is used to differentiate between "client facing" requests
  for a ClientRequestState vs. internal requrets for a CRS
* Users can tell if a query is retried using runtime profiles and the
  Impala Web UI
* "Impala Query Status" is a new field in runtime profiles that
  displays the ClientRequestState execution state (which includes
  the RETRYING and RETRIED states)
* The Impala Web UI will list all retried queries as being in the
  "RETRIED" state
* Retried queries skip all fe/ planning, authorization, etc.
* This feature is configurable ('retry_failed_queries') and is off by
  default

Refactoring:
* Changes the ClientRequestState so that it can take in an existing
  TExecRequest
* This is required when retrying queries because the
  TExecRequest of the failed query is copied and used for the
  ClientRequestState of the retried query
* ClientRequestState::ExecState is extended with three new states:
  RETRYING, RETRIED, and UNKNOWN.

Testing:
* Added integration tests in test_query_retries.py, these tests
  consistently pass when run locally
* Ran exhaustive tests
* Ran exhaustive tests with 'retry_failed_queries' set to true, no
  unexpected failures

TODO:
* Additional re-factoring / code cleanup
* Lots more documentation

Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-http-handler.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
A be/src/service/retry-work.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_query_retries.py
16 files changed, 793 insertions(+), 168 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/6
--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-01-31 Thread Thomas Tauber-Marshall (Code Review)

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@26
PS5, Line 26: A query cannot be retried once any results from the original query
:   have been fetched, this is to prevent users from seeing 
incorrect results
Do we have any mechanism for a user to specify that they want their results to 
always be spooled until the query is complete to ensure that any failed queries 
that are retriable do in fact get retried?


http://gerrit.cloudera.org:8080/#/c/14824/5//COMMIT_MSG@41
PS5, Line 41: The Impala Web UI will list all retried queries as being in the
:   "RETRIED" state
This may be a reasonable approach, but I think we should think about the user 
experience implications of this.

In particular, it means that a retried query will show up as two separate 
queries on the webui, which is potentially confusing, esp. since the retry will 
have a different query_id that won't correspond the anything user-facing (eg. 
won't be the query_id printed out by impala-shell) and as you've got the patch 
right now, I'm not even sure if there's a way to figure out which retries would 
correspond to which original queries. We could of course do something like add 
the retry query id to the profile of the original query, but then people have 
to click through to the profile to match things up.

There's also the problem that once CloseClientRequestState is called at the end 
of RetryQueryFromThreadPool, the original query will show up in the "Completed 
Queries" section (and may even disappear from there before the retry query 
completes if the query_log fills up), which is again probably confusing since 
from the perspective of the client that query is really still running.

Some of this could possibly wait until a follow up patch to work out, as long 
as we've got a plan.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h
File be/src/service/client-request-state-map.h:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state-map.h@38
PS5, Line 38: bool overwrite = false
I don't think this is being used anywhere anymore.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/client-request-state.h@82
PS5, Line 82: RETRYING, RETRIED, UNKNOWN
I think this approach, and the way you have to save the previous state and then 
sometimes use this or sometimes the previous state, is confusing.

Its probably more straight forward to just have a separate 'enum Retry State { 
RETRYING, RETRIED, NOT_RETRIED }

>From my other comments, if we add a ClientRequestState wrapper, it could go in 
>there, or even might not be necessary since the wrapper would already know if 
>we're retrying based on if there's multiple ClientRequestStates that its 
>wrapped around.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@632
PS5, Line 632: shared_ptr request_state = 
GetClientFacingRequestState(query_id);
Unless I'm missing something in this patch, this means that only the profile 
for the retry query will be available through the GetRuntimeProfile() api. 
Possibly fine to leave for followup work, but I think we definitely need to 
have some way of returning the profile for the original query too.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@920
PS5, Line 920:   unique_ptr exec_request = 
make_unique();
So I'm wondering if this patch would benefit from adding another layer of 
abstraction.

What I have in mind is something like a wrapper around ClientRequestState that 
can contain multiple ClientRequestStates (though possibly we would name the 
wrapper ClientRequestState and rename the current ClientRequestState to 
something like QueryInstanceState or whatever).

Some benefits of doing this:
- It would provide a natural place for TExecRequest to live without doing this 
stuff with unique_ptr and std::move, as well as anything else that's common 
across all retries.
- It would allow us to hide some of the details of retries from impala-server, 
eg. we wouldn't need the logic around BlockOnWait that you've duplicated 
between the hs2 server and the beeswax server.


http://gerrit.cloudera.org:8080/#/c/14824/5/be/src/service/impala-server.cc@1025
PS5, Line 1025: query_ctx->query_id = UuidToQueryId(random_generator()());
As mentioned before, I definitely think it would be awesome if we

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-01-24 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5517/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 24 Jan 2020 23:18:52 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9199: Add support for single query retries on cluster membership changes

2020-01-24 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14824 )

Change subject: IMPALA-9199: Add support for single query retries on cluster 
membership changes
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5516/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14824
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd
Gerrit-Change-Number: 14824
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 24 Jan 2020 23:16:59 +
Gerrit-HasComments: No

1 2 >

1 - 100 of 111 matches

Mail list logo