[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Reviewed-on: http://gerrit.cloudera.org:8080/15154
Reviewed-by: Thomas Tauber-Marshall 
Tested-by: Impala Public Jenkins 
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
11 files changed, 437 insertions(+), 266 deletions(-)

Approvals:
  Thomas Tauber-Marshall: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 9
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 8: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 25 Feb 2020 01:22:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5317/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 24 Feb 2020 21:36:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5394/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 24 Feb 2020 20:49:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 8: Code-Review+2

carrying forward


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 24 Feb 2020 20:49:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Thomas Tauber-Marshall (Code Review)
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#8).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
11 files changed, 437 insertions(+), 266 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/8
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5316/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 24 Feb 2020 20:44:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-24 Thread Thomas Tauber-Marshall (Code Review)
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#7).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
11 files changed, 437 insertions(+), 266 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/7
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-21 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 6: Code-Review+2

This makes sense to me. Since Sahil was ready to bump to +2, I'm going to +2.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 21 Feb 2020 18:33:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-20 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@90
PS5, Line 90:   exec_serialization_timer_ = 
PROFILE_ExecSerializationTimer.Instantiate(host_profile_);
> Sure, I mean after thinking about it some more, I would probably prefer to
Hadn't noticed "Backend startup latencies" before. So true, you could derive an 
estimate of the serialization time with the current metrics, however, I think 
still think it requires expert knowledge of how these runtime profile metrics 
are created. Most average Impala users wouldn't be able to figure this out on 
their own.

Regardless, adding this is really only a nice to have, and not directly related 
to this patch. So happy with just pushing this out to a follow up JIRA.



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 21 Feb 2020 00:32:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-20 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@90
PS5, Line 90:   exec_serialization_timer_ = 
PROFILE_ExecSerializationTimer.Instantiate(host_profile_);
> this will add the timer to the 'Per Node Profiles' right? would it be suffi
Sure, I mean after thinking about it some more, I would probably prefer to just 
remove this timer entirely - we already have the time for how long it took to 
do the entire startup (serialization + waiting on rpcs to return) from the 
query timeline and we also know how long the rpcs took from the "Backend 
startup latencies" histogram, so the serialization time is basically just the 
difference between the two.

So, I don't think we're really getting any extra information by adding this 
timer.



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 20 Feb 2020 22:29:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-19 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 6: Code-Review+1

(2 comments)

minor comments, generally LGTM. can upgrade to +2 if someone else is willing to 
take a quick look over this.

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@90
PS5, Line 90:   exec_serialization_timer_ = 
PROFILE_ExecSerializationTimer.Instantiate(host_profile_);
this will add the timer to the 'Per Node Profiles' right? would it be 
sufficient just to have one high-level timer that aggregates all these values? 
- e.g. a timer that aggregates the serialization time spent for all backends?


http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc@447
PS5, Line 447:   ThriftSerializer serializer(true);
> Sure, I can add one if you want, but I think this is less interesting to ti
i think including this would make more sense given my other comment. i think 
for completeness it would be good to include so that the timer tracks all 
serialization done for a query



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 19 Feb 2020 17:13:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-18 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@195
PS5, Line 195: rpc_latency_
> I'm not sure you're correct about this. One observation is that I haven't a
So after talking with Tim about this offline and doing some additional reading 
(eg. https://en.cppreference.com/w/cpp/language/memory_model) I'm confident 
that its thread-safe as is.

The reason is that the read and write are synchronized by blocking/notifying on 
the CountingBarrier, which results in a memory barrier.



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 18 Feb 2020 22:25:50 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5259/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 18 Feb 2020 21:44:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-18 Thread Thomas Tauber-Marshall (Code Review)
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#6).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

It also adds a new profile timer, ExecSerializationTimer, which
records the amount of time spent perparing the Exec() params for each
backend.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
11 files changed, 444 insertions(+), 266 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/6
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-18 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@195
PS5, Line 195: rpc_latency_
> I'm not sure if this variable is thread safe anymore - I think this Cb is b
I'm not sure you're correct about this. One observation is that I haven't 
actually changed the thread safety of these variables - previously they would 
have been written in exec_rpc_thread_pool_ and read by a different thread.

I'll also note that the article you previously linked to 
(http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/) seems 
to say this isn't an issue for x86, which afaik Impala is only ever run on x86, 
though I'm not necessarily thrilled about relying on that.

Its probably not a big deal to just go ahead and hold 'lock_' when these are 
read, since they're only read in FinishBackendStartup() and it won't really add 
any lock contention.

However, I am curious about this, so I'm gonna try to ask around to see if 
there's someone more knowledgeable who can weigh in.


http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@227
PS5, Line 227: DCHECK(!exec_done_);
> move this to right before the ThriftSerializer is called? otherwise if the 
Sure, I moved it down to right before the call to SetRpcParams(), which I think 
makes sense to include in this timer as part of the work to set up the rpc 
params.

If that still seems confusing, it might be necessary to rename the timer.


http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc@447
PS5, Line 447:   ThriftSerializer serializer(true);
> should SCOPED_TIMER(exec_serialization_timer_); go here as well?
Sure, I can add one if you want, but I think this is less interesting to time 
than the other serialization stuff as it only happens once per query and the 
amount of data being serialized is generally much smaller.


http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc@982
PS5, Line 982:   DCHECK(exec_rpcs_complete_.Load()) << "Exec() must be called 
first.";
> Will this break the ImpalaServer UnresponsiveBackendThread thread? It calls
So its confusing and I don't know why they did it this way, but all of these 
calls are on a Coordinator object that is returned by 
ClientRequestState::GetCoordinator(), which returns nullptr if the coordinator 
hasn't successfully completed the call to Exec(), so no all of the DCHECKs are 
valid.



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 18 Feb 2020 21:01:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-13 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@195
PS5, Line 195: rpc_latency_
I'm not sure if this variable is thread safe anymore - I think this Cb is being 
invoked by a kRPC thread right? but it gets read later on in 
Coordinator::FinishBackendStartup without holding onto the BackendState lock_

I think the same issue applies to exec_rpc_status_

It looks like reads of last_report_time_ms_ get around this by acquiring the 
lock_ inside the BackendState::last_report_time_ms() method.


http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator-backend-state.cc@227
PS5, Line 227: SCOPED_TIMER(exec_serialization_timer_);
move this to right before the ThriftSerializer is called? otherwise if the 
below code throws an error or returns, this could be non-zero, even if no 
serialization is done, which seems odd


http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc@447
PS5, Line 447:   ThriftSerializer serializer(true);
should SCOPED_TIMER(exec_serialization_timer_); go here as well?


http://gerrit.cloudera.org:8080/#/c/15154/5/be/src/runtime/coordinator.cc@982
PS5, Line 982:   DCHECK(exec_rpcs_complete_.Load()) << "Exec() must be called 
first.";
Will this break the ImpalaServer UnresponsiveBackendThread thread? It calls 
this method for each registered query, and if I'm reading the code in 
ImpalaServer::ExecuteInternal correctly, it register the query before calling 
Coordinator::Exec

I think some of the other DCHECKs might suffer from the same issue - 
specifically any that are called due to a HTTP request from the Web UI. Since 
those are triggered asynchronously there could be a race condition between when 
the query is registered, when the HTTP request is fired, and when 
Coordinator::Exec is actually called. For example, BackendsToJson is called by 
ImpalaHttpHandler::QueryBackendsHandler.



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 13 Feb 2020 20:22:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5211/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 12 Feb 2020 20:21:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5210/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 12 Feb 2020 19:44:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-12 Thread Thomas Tauber-Marshall (Code Review)
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#5).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
11 files changed, 444 insertions(+), 266 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/5
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-12 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15154/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15154/3//COMMIT_MSG@46
PS3, Line 46: Testing:
: - Added a e2e test that verifies that a query where an Exec()
> this can be rebased now, right?
Done



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 12 Feb 2020 19:35:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-12 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@23
PS2, Line 23: Removing this thread pool has potential performance implications, 
as
: it means that the Exec() parameters are serialized in serialize 
rather
: than in parallel (with the level of parallelism determined by the 
size
: of the thread pool, which was configurable by an Advanced flag and
: defaulted to 12).
> right, a more fine grained counter around all the calls to the ThriftSerial
Done


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.h@406
PS3, Line 406:
> nit: single quote
Done


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@310
PS3, Line 310:
> nit: better name might be 'error' since this is only invoked on error
Its also invoked in the case of IsEmptyBackend() being true, which isn't an 
error.


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@549
PS3, Line 549:  (!exec_rpc_sent_) {
> can this be re-factored as well so that notify is called after releasing th
Obviously its sort of tricky. I could do the same thing as in the other 
functions (put everything in a block with a goto label after it) but that's 
ugly and I hate to do it for only even one branch.

I suppose there's no reason that I can't just call 'unlock()' on the 
unique_lock, other than the fact that we don't usually do that. Should be fine 
as long as I leave a comment explaining why we're doing it.


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@900
PS3, Line 900: ess
> nit: the 'For' seems extraneous and doesn't always make sense in the contex
Done


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@901
PS3, Line 901: Coordin
> nit: maybe 'target backend' to make it clear that the Coordinator is attemp
Done


http://gerrit.cloudera.org:8080/#/c/15154/3/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

http://gerrit.cloudera.org:8080/#/c/15154/3/tests/custom_cluster/test_rpc_exception.py@28
PS3, Line 28: def _get_rpc_debug_action(rpc, action, port=KRPC_PORT):
:   """Returns a debug action that causes rpcs with the name 'rpc' 
that are sent to the
:   impalad at 'port' to execute the debug aciton 'action'."""
:   return "IMPALA_SERVICE_POOL:127.0.0.1:{port}:{rpc}:{action}" \
:   .format(rpc=rpc, port=port, action=action)
:
:
: def _get_fail_action(rpc, error=None, port=KRPC_PORT, p=0.1):
:   """Returns a debug action that causes rpcs with the name 'rpc' 
that are sent to the
:   impalad at 'port' to FAIL with probability 'p' and return 
'error' if specified."""
:   action = "FAIL@%s" % p
:   if error is not None:
: action += "@" + error
:   return _get_rpc_debug_action(rpc, action, port=port
> do we expect to use these methods outside of the TestRPCException class?
No, I moved them outside of the class because I couldn't figure out how to have 
one call the other given that they're called from the function decorators where 
there isn't a 'this' available and python's support for 'static' type stuff is 
weird. Probably someone who understand python better than me could make it work.

I added '_' to the start of both functions names to make it clearer they're not 
intended for use outside this file.



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 12 Feb 2020 19:00:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-12 Thread Thomas Tauber-Marshall (Code Review)
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#4).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


TODO: once IMPALA-9335 (krpc rebase) goes in, the change in
be/src/kudu/rpc/connection.cc can be removed from this patch.

Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/kudu/rpc/connection.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
12 files changed, 446 insertions(+), 261 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/4
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-11 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@23
PS2, Line 23: Removing this thread pool has potential performance implications, 
as
: it means that the Exec() parameters are serialized in serialize 
rather
: than in parallel (with the level of parallelism determined by the 
size
: of the thread pool, which was configurable by an Advanced flag and
: defaulted to 12).
> Yes, we record the amount of time query startup takes, the time between whe
right, a more fine grained counter around all the calls to the ThriftSerializer 
might be nice though, especially since this has bitten us in the past: 
IMPALA-8732

there is also a lot of stuff that happens between those two timeline events - 
e.g. issuing all the Exec RPCs and waiting for them to complete


http://gerrit.cloudera.org:8080/#/c/15154/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15154/3//COMMIT_MSG@46
PS3, Line 46: TODO: once IMPALA-9335 (krpc rebase) goes in, the change in
: be/src/kudu/rpc/connection.cc can be removed from this patch.
this can be rebased now, right?


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.h@406
PS3, Line 406: "
nit: single quote


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@310
PS3, Line 310: done
nit: better name might be 'error' since this is only invoked on error


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@549
PS3, Line 549: exec_done_cv_.NotifyAll();
can this be re-factored as well so that notify is called after releasing the 
lock?


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@900
PS3, Line 900: For
nit: the 'For' seems extraneous and doesn't always make sense in the context of 
the usages of this method


http://gerrit.cloudera.org:8080/#/c/15154/3/be/src/runtime/coordinator-backend-state.cc@901
PS3, Line 901: backend
nit: maybe 'target backend' to make it clear that the Coordinator is attempting 
to make a call to a remote backend?


http://gerrit.cloudera.org:8080/#/c/15154/3/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

http://gerrit.cloudera.org:8080/#/c/15154/3/tests/custom_cluster/test_rpc_exception.py@28
PS3, Line 28: def get_rpc_debug_action(rpc, action, port=KRPC_PORT):
:   """Returns a debug action that causes rpcs with the name 'rpc' 
that are sent to the
:   impalad at 'port' to execute the debug aciton 'action'."""
:   return "IMPALA_SERVICE_POOL:127.0.0.1:{port}:{rpc}:{action}" \
:   .format(rpc=rpc, port=port, action=action)
:
:
: def get_fail_action(rpc, error=None, port=KRPC_PORT, p=0.1):
:   """Returns a debug action that causes rpcs with the name 'rpc' 
that are sent to the
:   impalad at 'port' to FAIL with probability 'p' and return 
'error' if specified."""
:   action = "FAIL@%s" % p
:   if error is not None:
: action += "@" + error
:   return get_rpc_debug_action(rpc, action, port=port)
do we expect to use these methods outside of the TestRPCException class?



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 11 Feb 2020 17:31:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5168/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 07 Feb 2020 23:48:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-07 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 3:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@10
PS2, Line 10: Previously, Impala would issue all the Exec()s, wait for all of 
them
: to complete, and then check if any of them resulted in an error. 
We
: now stop issuing Exec()s and cancel any that are still in flight 
as
: soon as an error occurs.
> so this covers IMPALA-6788 as well, right?
Not completely - there's still the case where we get an error report from a 
running fragment while still doing startup. With this patch, we'll still wait 
for all Exec()s to complete before processing the error.

I added a TODO with this JIRA to call this out.


http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@23
PS2, Line 23: Removing this thread pool has potential performance implications, 
as
: it means that the Exec() parameters are serialized in serialize 
rather
: than in parallel (with the level of parallelism determined by the 
size
: of the thread pool, which was configurable by an Advanced flag and
: defaulted to 12).
> do we have a runtime counter to measure this overhead?
Yes, we record the amount of time query startup takes, the time between when we 
mark the event "Ready to start on X backends" to when we mark "All X execution 
backends started"


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@82
PS2, Line 82: Return true
> nit: "Returns true"
Done


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@96
PS2, Line 96: after WaitOnExecRpc(
> nit: needs to be updated
Done


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@102
PS2, Line 102: Exec
> nit: should the docs of this method be updated to indicate that this method
Done


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@108
PS2, Line 108: imes and f
> might want to mention some more docs about thread safety - e.g. is a client
Done


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@369
PS2, Line 369:
 :
> same as below
Done


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@375
PS2, Line 375:
 :   int64_t last_report_time_ms_ = 0;
> i believe this isn't thread safe. even if it is no longer set after Exec()
Went ahead and protected this with 'lock_' always.

It does mean I removed some DCHECKs on it in places where we're not taking 
'lock_' otherwise, but I don't think that's a big deal.


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@a452
PS2, Line 452:
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
> looks like dead code, but are any of the TODOs still relevant?
Sort of. I basically replaced this with the large comment I added earlier in 
this function, which addresses the same basic issue but better reflects the 
actual current state of things.


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@173
PS2, Line 173:
> i believe we generally try to notify any waiting threads after the lock has
Yes, there can potentially be multiple calls to WaitOnExec(), eg. if there are 
multiple calls to Coordinator::WaitOnExecRpcs(), as discussed in another 
comment.


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@193
PS2, Line 193:
> nit: won't this be called twice, once in SetExecError and again when method
Done


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@228
PS2, Line 228: if (IsEmptyBackend
> is this thread safe? this gets set without holding onto the lock, and its p
So I went ahead and fixed this by holding 'lock_' for all of the Exec() 
function, rather than just taking it right before sending the rpc.

Its nice, because since we also hold lock_ for all of Cancel(), it 
significantly simplifies reasoning about possible interleavings between those 
functions.

The downside is that we can't stop ourselves from sending an exec rpc on to 
backend if we've already started executing BackendState::Exec(), eg. if we 
realize we hit an error from a different backend while 

[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-07 Thread Thomas Tauber-Marshall (Code Review)
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#3).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


TODO: once IMPALA-9335 (krpc rebase) goes in, the change in
be/src/kudu/rpc/connection.cc can be removed from this patch.

Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/kudu/rpc/connection.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
12 files changed, 442 insertions(+), 267 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/3
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-05 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@10
PS2, Line 10: Previously, Impala would issue all the Exec()s, wait for all of 
them
: to complete, and then check if any of them resulted in an error. 
We
: now stop issuing Exec()s and cancel any that are still in flight 
as
: soon as an error occurs.
so this covers IMPALA-6788 as well, right?



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 05 Feb 2020 18:20:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-04 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 2:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@23
PS2, Line 23: Removing this thread pool has potential performance implications, 
as
: it means that the Exec() parameters are serialized in serialize 
rather
: than in parallel (with the level of parallelism determined by the 
size
: of the thread pool, which was configurable by an Advanced flag and
: defaulted to 12).
do we have a runtime counter to measure this overhead?


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@82
PS2, Line 82: Returns True
nit: "Returns true"


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@96
PS2, Line 96: rpc_complete_barrier
nit: needs to be updated


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@102
PS2, Line 102: Exec
nit: should the docs of this method be updated to indicate that this method is 
now asynchronous?


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@108
PS2, Line 108: WaitOnExec
might want to mention some more docs about thread safety - e.g. is a client 
allowed to call Exec from one thread and then call WaitOnExec from another 
thread?


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@369
PS2, Line 369: Only needs to be protected by
 :   /// 'lock_' until Exec() has finished, after which it is 
constant.
same as below


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@375
PS2, Line 375: Only needs
 :   /// to be protected by 'lock_' until Exec() has finished, 
after which it is constant.
i believe this isn't thread safe. even if it is no longer set after Exec() has 
finished, the reader still needs to acquire the lock - the reader thread could 
have cached the exec_done_ in its CPU cache, which means it won't see an 
updates from the writer

http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/ has some 
details, specifically the part about thread caches

these could be made atomic if we don't want to acquire the lock each time they 
are read.


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@a452
PS2, Line 452:
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
looks like dead code, but are any of the TODOs still relevant?


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@173
PS2, Line 173: exec_done_cv_.NotifyAll();
i believe we generally try to notify any waiting threads after the lock has 
been released. that way the waiting threads can acquire the lock immediately, 
rather than waking up, and then going back to sleep because the lock is still 
acquired - IMPALA-6125

also will there be multiple threads waiting on this CV?


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@193
PS2, Line 193: exec_done_cv_.NotifyAll();
nit: won't this be called twice, once in SetExecError and again when method 
exits?


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@228
PS2, Line 228: exec_done_ = true;
is this thread safe? this gets set without holding onto the lock, and its 
possible for Coordinator::UpdateBackendExecStatus or UpdateFilter to call 
WaitOnExec from a different thread


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@236
PS2, Line 236: SetExecError
same issue as above, this sets exec_done_ without holding the lock


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@853
PS2, Line 853: DCHECK(exec_done_)
should be run after acquiring the lock, or made atomic


http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator.cc@427
PS2, Line 427:   if (exec_rpcs_complete_.Load()) return;
is the idea here that this method should only ever be executed once? whats the 
thread safety guarantee on this method suppose to be?



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit 

[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 2:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/5123/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 04 Feb 2020 21:07:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-04 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py@27
PS1, Line 27:
> flake8: E302 expected 2 blank lines, found 1
Done


http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py@33
PS1, Line 33:
> flake8: E302 expected 2 blank lines, found 1
Done


http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py@147
PS1, Line 147: a
> flake8: F841 local variable 'result' is assigned to but never used
Done



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 04 Feb 2020 18:15:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-04 Thread Thomas Tauber-Marshall (Code Review)
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#2).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


TODO: once IMPALA-9335 (krpc rebase) goes in, the change in
be/src/kudu/rpc/connection.cc can be removed from this patch.

Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/kudu/rpc/connection.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
12 files changed, 387 insertions(+), 203 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/2
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 1:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/5597/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 03 Feb 2020 23:06:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15154 )

Change subject: IMPALA-8712: Make ExecQueryFInstances async
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py
File tests/custom_cluster/test_rpc_exception.py:

http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py@27
PS1, Line 27: def get_rpc_debug_action(rpc, action, port=KRPC_PORT):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py@33
PS1, Line 33: def get_fail_action(rpc, error=None, port=KRPC_PORT, p=0.1):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/15154/1/tests/custom_cluster/test_rpc_exception.py@147
PS1, Line 147: r
flake8: F841 local variable 'result' is assigned to but never used



--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 03 Feb 2020 22:21:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8712: Make ExecQueryFInstances async

2020-02-03 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/15154


Change subject: IMPALA-8712: Make ExecQueryFInstances async
..

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.


TODO: once IMPALA-9335 (krpc rebase) goes in, the change in
be/src/kudu/rpc/connection.cc can be removed from this patch.

Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/kudu/rpc/connection.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
11 files changed, 382 insertions(+), 201 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/1
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall