[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-03-27 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has abandoned this change. ( 
http://gerrit.cloudera.org:8080/9718 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Abandoned

Thanks Lars for taking a look at this. After discussing with Dan, we've decided 
to leave this alone for now, in anticipation of more general improvements to 
query lifecycle logic.

I have a patch out to disable the remaining commonly seen flaky test from this: 
https://gerrit.cloudera.org/#/c/9822/
--
To view, visit http://gerrit.cloudera.org:8080/9718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
Gerrit-Change-Number: 9718
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-03-26 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9718 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/9718/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9718/1//COMMIT_MSG@7
PS1, Line 7: Fix flaky test_profile_fragment_instances
This subject makes it sound like you're fixing the flaky test. But you're 
actually making an impala change, so could you reword this to summarize the 
change?

Otherwise, when looking at git commits, it's hard to know that this change is 
making a subtle coordinator behavior change.


http://gerrit.cloudera.org:8080/#/c/9718/1//COMMIT_MSG@43
PS1, Line 43: Returning results for queries that are cancelled by the user is
: unaffected as the manual cancel path causes 
WaitForBackendCompletion().
I don't follow that. What does this have to do with query results?  Also, what 
do you mean by the "path causes WaitForBackendCompletion()". Which callpath are 
you referring to?


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h@119
PS1, Line 119:   boost::mutex* lock() { return _; }
> I don't think it will help with the iteration part, but it'll prevent mista
I think we should try hard to not expose the lock out of a class. That's a big 
problem with ClientRequestState.  The fact that it's exposed usually means some 
abstraction is missing or wrong.

(If we really do end up needing to take the series of locks, then it'd be good 
to follow the pattern here: 
https://github.com/apache/impala/blob/dc2f69e5a0b36ca721eeadeec661a251c957410b/be/src/runtime/bufferpool/buffer-allocator.cc#L341)


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator.cc@1028
PS1, Line 1028:   // ensure the profile isn't being updated when we call 
Add(Split|Exec)Stats below.
What are the backend locks protecting in Add*Stats()?  Which profile is the 
comment referring to?



--
To view, visit http://gerrit.cloudera.org:8080/9718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
Gerrit-Change-Number: 9718
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Mon, 26 Mar 2018 23:36:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-03-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9718 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 1:

(5 comments)

Thanks for the replies. I left some question around potential simplifications 
but the current logic looks correct to me.

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h@119
PS1, Line 119:   boost::mutex* lock() { return _; }
> Is there a reasonable way to make that work with the way that we have a lis
I don't think it will help with the iteration part, but it'll prevent mistakes 
where we forget to release the lock. During the locking phase we can just 
populate a vector of these scoped lock handles.


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@237
PS1, Line 237:   return num_remaining_instances_ == 0 || (!status_.ok() && 
!status_.IsCancelled());
> Sure, that would be equivalent. I don't have a strong feeling either way ab
That makes sense, thanks for the explanation.


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@249
PS1, Line 249:   for (const TFragmentInstanceExecStatus& instance_exec_status:
> I didn't want to add any more locking to the profile update path, which may
My main concern was making the locking protocol less complicated but I wasn't 
aware that the lock I'm looking for is shared across backends.

Looking at the calls to ComputeQuerySummary(), two of them already follow a 
call to WaitForBackendCompletion(). Can we call WaitForBackendCompletion() in 
Coordinator::Cancel(), too, before calling ComputeQuerySummary()?


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@251
PS1, Line 251: instance_status
> Not quite, since (at least for successfully completed queries where
 > fragments are cancelled because returned_all_results_ becomes true)
 > BackendState::status_ is only updated as a result of these status
 > reports, so the first time an instance reports CANCELLED, status_
 > won't already be CANCELLED.

Can the coordinator cancel the backend_states in this case before sending out 
the cancellations?

 >
 > What you would need is access to Coordinator::query_status_, which
 > we could pass into ApplyExecStatusReport(), except that accessing
 > it requires taking Coordinator::lock_, which of course we don't
 > want to hold for the duration of ApplyExecStatusReport().
 >
 > So, I think that what we could do is: in UpdateBackendExecStatus,
 > take Coordinator::lock_ and make a copy of Coordinator::query_status_.
 > Release lock_ and pass the copy into ApplyExecStatusReport(), along
 > with a copy of the value of returned_all_results_. We can then say
 > something like:
 > DCHECK(!instance_status.IsCancelled() || (!query_status_.ok() ||
 > returned_all_results_))


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@314
PS1, Line 314:   return IsDone();
> Right, that's the intended behavior.
Thx for the explanation.



--
To view, visit http://gerrit.cloudera.org:8080/9718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
Gerrit-Change-Number: 9718
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Wed, 21 Mar 2018 18:07:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-03-20 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9718 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 1:

(5 comments)

Obviously this is all a complete mess.

I've thought about some ways to clean it up (see eg: 
https://gerrit.cloudera.org/#/c/9656/) but haven't come up with something very 
compelling. If anyone has any ideas, I'd love to hear them.

My understanding is that there is some work planned soon around query lifecycle 
and profile reporting as part of the scale project, so hopefully we'll be able 
to make more progress towards making this sane.

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h@119
PS1, Line 119:   boost::mutex* lock() { return _; }
> If we keep this, I think this could help prevent coding errors by handing o
Is there a reasonable way to make that work with the way that we have a list of 
Backends that need to be iterated over and locked?


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@237
PS1, Line 237:   return num_remaining_instances_ == 0 || (!status_.ok() && 
!status_.IsCancelled());
> Is it possible to move the !status_.IsCancelled() to where we expect it to
Sure, that would be equivalent. I don't have a strong feeling either way about 
which is preferable.

I chose to go this route because the behavior of ApplyExecStatusReport() 
returning a bool indicating is IsDone() became true is already unfortunate (to 
be clear, I wrote that behavior to fix a different coordinator/runtime profile 
bug, so its my fault) but at least it was tied to something concrete.

I suppose we could just say "ApplyExecStatusReport() returns true if 
'num_remaining_backends_' should be decreased", but then we end up with the 
situation where a backend is still considered to be executing by the 
coordinator even though IsDone() returns true, which also screws up the logic 
of LogFirstInProgress().


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@249
PS1, Line 249:   for (const TFragmentInstanceExecStatus& instance_exec_status:
> Instead of acquiring all locks in ComputeQuerySummary(), we could acquire a
I didn't want to add any more locking to the profile update path, which may be 
called a very large number of times for queries that are long running or have 
many fragments.

The relevant root profile would be for the FragmentStats for the fragment this 
instance corresponds to, and there could be contention with other backends 
trying to update different instances of the same fragment.

Adding the extra locking in ComputeQuerySummary() seemed better because it 
should only be called once at the end of the query.


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@251
PS1, Line 251: instance_status
> Can we check here that (!instance_status.IsCancelled() || status_.IsCancell
Not quite, since (at least for successfully completed queries where fragments 
are cancelled because returned_all_results_ becomes true) BackendState::status_ 
is only updated as a result of these status reports, so the first time an 
instance reports CANCELLED, status_ won't already be CANCELLED.

What you would need is access to Coordinator::query_status_, which we could 
pass into ApplyExecStatusReport(), except that accessing it requires taking 
Coordinator::lock_, which of course we don't want to hold for the duration of 
ApplyExecStatusReport().

So, I think that what we could do is: in UpdateBackendExecStatus, take 
Coordinator::lock_ and make a copy of Coordinator::query_status_. Release lock_ 
and pass the copy into ApplyExecStatusReport(), along with a copy of the value 
of returned_all_results_. We can then say something like:
DCHECK(!instance_status.IsCancelled() || (!query_status_.ok() || 
returned_all_results_))


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@314
PS1, Line 314:   return IsDone();
> The new definition of IsDone() will return false for cancelled backends her
Right, that's the intended behavior.

num_remaining_backends_ is used to decide when to return from 
WaitForBackendCompletion(). There are two cases:

- The cancellation is happening because returned_all_results_ became true. By 
preventing num_remaining_backends_ from decreasing, we ensure that we wait for 
all fragments to Finalize() and provide us with final profile info. 
num_remaining_backends_ will still be decreased, but only once 
num_remaining_instances_ reaches 0.

- The cancellation is happening because the user 

[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-03-19 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9718 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 1:

(10 comments)

Had some comments on the overall approach.

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/common/status.h
File be/src/common/status.h:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/common/status.h@98
PS1, Line 98: CANCELLED should not be reported by a fragment
Can we enforce this by checking that instances only report cancellation after 
the query has been cancelled by the coordinator?


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.h@119
PS1, Line 119:   boost::mutex* lock() { return _; }
If we keep this, I think this could help prevent coding errors by handing out 
scoped lock guards instead.


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@237
PS1, Line 237:   return num_remaining_instances_ == 0 || (!status_.ok() && 
!status_.IsCancelled());
Is it possible to move the !status_.IsCancelled() to where we expect it to have 
an effect? That way we wouldn't have to change the definition of IsDone() to 
exclude IsCancelled().


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@248
PS1, Line 248:   if (IsDone()) return false;
Here we would add && !status.IsCancelled().


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@249
PS1, Line 249:   for (const TFragmentInstanceExecStatus& instance_exec_status:
Instead of acquiring all locks in ComputeQuerySummary(), we could acquire a 
lock on the root profile children here.


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@251
PS1, Line 251: instance_status
Can we check here that (!instance_status.IsCancelled() || 
status_.IsCancelled())  ?


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@284
PS1, Line 284:   // We can't update this backend's profile if 
ReportQuerySummary() is running,
This comment seems outdated.


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@314
PS1, Line 314:   return IsDone();
The new definition of IsDone() will return false for cancelled backends here, 
whereas it returned true before. This will prevent the coordinator from 
decreasing num_remaining_backends_ in UpdateBackendExecStatus().


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator-backend-state.cc@397
PS1, Line 397:  || status_.IsCancelled()
This could then be removed.


http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/9718/1/be/src/runtime/coordinator.cc@1026
PS1, Line 1026:   if (backend_states_.empty()) return;
We could acquire a lock on the root profile of the fragment stats here instead 
(see my comments in the backend state).



--
To view, visit http://gerrit.cloudera.org:8080/9718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
Gerrit-Change-Number: 9718
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 20 Mar 2018 00:33:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-03-19 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9718


Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..

IMPALA-6338: Fix flaky test_profile_fragment_instances

test_profile_fragment_instances checks that, once all the results have
been returned, every fragment instance appears in the query profile
for a query that internally cancels fragment instances that are still
executing when the results have been fully returned.

Every fis is guaranteed to send a profile to the coordinator in
Finalize(), but previously fragment profiles were not applied by the
coordinator if the backend was 'done', defined as either all instances
having completed or one has entered an error state (including
cancelled).

So, the test could fail by the following sequence:
- Some fragment for a particular backend sends an update to the
  coordinator. 'returned_all_results_' is true, so the coordinator
  responds indicating the the backend should cancel its remaining
  fragments.
- Another fragment from that backend executes Finalize() and reports
  that it was cancelled. This causes the coordinator to consider the
  entire backend to be 'done'.
- A third fragment, which had not previously sent a report from the
  reporting thread, from the same backend executes Finalize(). This
  report will not be applied by the coordinator as the backend is
  considered 'done', so this fragment will not appear in the final
  profile.

The solution is to change the definition of 'done' to not include a
backend that has been cancelled but still has fragments that haven't
completed. This guarantees that for queries that complete successfully
and are cancelled internally, all fis will send a report and have it
applied by the coordinator before all results have been returned,
since if eos is true Coordinator::GetNext() calls
WaitForBackendCompletion(), which in this situation will now wait for
all fis to Finalize().

Returning results for queries that are cancelled by the user is
unaffected as the manual cancel path causes WaitForBackendCompletion().

Testing:
- Ran test_profile_fragment_instances in a loop with no failures.
  I can reliably repro the original problem with a few carefully
  placed sleeps.

Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
---
M be/src/common/status.h
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M tests/query_test/test_observability.py
6 files changed, 37 insertions(+), 20 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/9718/1
--
To view, visit http://gerrit.cloudera.org:8080/9718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
Gerrit-Change-Number: 9718
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-02-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8997 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 6: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 02 Feb 2018 23:22:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-02-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8997 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..

IMPALA-6338: Fix flaky test_profile_fragment_instances

test_profile_fragment_instances checks that, once all the results have
been returned, every fragment instance appears in the query profile
for a query that internally cancels fragment instances that are still
executing when the results have been fully returned.

Every fis is guaranteed to send a profile to the coordinator in
Finalize(), but previously fragment profiles were not applied by the
coordinator if the backend was 'done', defined as either all instances
having completed or one has entered an error state (including
cancelled).

So, the test could fail by the following sequence:
- Some fragment for a particular backend sends an update to the
  coordinator. 'returned_all_results_' is true, so the coordinator
  responds indicating the the backend should cancel its remaining
  fragments.
- Another fragment from that backend executes Finalize() and reports
  that it was cancelled. This causes the coordinator to consider the
  entire backend to be 'done'.
- A third fragment, which had not previously sent a report from the
  reporting thread, from the same backend executes Finalize(). This
  report will not be applied by the coordinator as the backend is
  considered 'done', so this fragment will not appear in the final
  profile.

The solution is to change the definition of 'done' to not include a
backend that has been cancelled but still has fragments that haven't
completed. This guarantees that for queries that complete successfully
and are cancelled internally, all fis will send a report and have it
applied by the coordinator before all results have been returned,
since if eos is true Coordinator::GetNext() calls
WaitForBackendCompletion(), which in this situation will now wait for
all fis to Finalize().

Returning results for queries that are cancelled by the user is
unaffected as the manual cancel path causes WaitForBackendCompletion().

Testing:
- Ran test_profile_fragment_instances in a loop with no failures.
  I can reliably repro the original problem with a few carefully
  placed sleeps.

Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Reviewed-on: http://gerrit.cloudera.org:8080/8997
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins
---
M be/src/common/status.h
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M tests/query_test/test_observability.py
6 files changed, 30 insertions(+), 18 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-02-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8997 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 6: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 01 Feb 2018 23:39:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-02-01 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Sailesh Mukil, Joe McDonnell, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8997

to look at the new patch set (#6).

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..

IMPALA-6338: Fix flaky test_profile_fragment_instances

test_profile_fragment_instances checks that, once all the results have
been returned, every fragment instance appears in the query profile
for a query that internally cancels fragment instances that are still
executing when the results have been fully returned.

Every fis is guaranteed to send a profile to the coordinator in
Finalize(), but previously fragment profiles were not applied by the
coordinator if the backend was 'done', defined as either all instances
having completed or one has entered an error state (including
cancelled).

So, the test could fail by the following sequence:
- Some fragment for a particular backend sends an update to the
  coordinator. 'returned_all_results_' is true, so the coordinator
  responds indicating the the backend should cancel its remaining
  fragments.
- Another fragment from that backend executes Finalize() and reports
  that it was cancelled. This causes the coordinator to consider the
  entire backend to be 'done'.
- A third fragment, which had not previously sent a report from the
  reporting thread, from the same backend executes Finalize(). This
  report will not be applied by the coordinator as the backend is
  considered 'done', so this fragment will not appear in the final
  profile.

The solution is to change the definition of 'done' to not include a
backend that has been cancelled but still has fragments that haven't
completed. This guarantees that for queries that complete successfully
and are cancelled internally, all fis will send a report and have it
applied by the coordinator before all results have been returned,
since if eos is true Coordinator::GetNext() calls
WaitForBackendCompletion(), which in this situation will now wait for
all fis to Finalize().

Returning results for queries that are cancelled by the user is
unaffected as the manual cancel path causes WaitForBackendCompletion().

Testing:
- Ran test_profile_fragment_instances in a loop with no failures.
  I can reliably repro the original problem with a few carefully
  placed sleeps.

Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
---
M be/src/common/status.h
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M tests/query_test/test_observability.py
6 files changed, 30 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/8997/6
--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-02-01 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8997 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8997/5/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/8997/5/be/src/runtime/coordinator-backend-state.h@223
PS5, Line 223:   /// initiated cancel because all the results were already 
returned or an error was hit.
> Are we assuming that status_ == CANCELLED implies that the cancellation was
Yes, the assumption here is that status_ == CANCELLED here must have come from 
the coordinator - if a fragment reports that it's cancelled because it failed, 
these changes would mean that the coordinator would not initiate cancellation 
of the rest of the query as a result.

Added/cleaned up some comments around this.



--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 01 Feb 2018 23:06:43 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-01-31 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Sailesh Mukil, Joe McDonnell,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8997

to look at the new patch set (#4).

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..

IMPALA-6338: Fix flaky test_profile_fragment_instances

test_profile_fragment_instances checks that, once all the results have
been returned, every fragment instance appears in the query profile
for a query that internally cancels fragment instances that are still
executing when the results have been fully returned.

Every fis is guaranteed to send a profile to the coordinator in
Finalize(), but previously fragment profiles were not applied by the
coordinator if the backend was 'done', defined as either all instances
having completed or one has entered an error state (including
cancelled).

So, the test could fail by the following sequence:
- Some fragment for a particular backend sends an update to the
  coordinator. 'returned_all_results_' is true, so the coordinator
  responds indicating the the backend should cancel its remaining
  fragments.
- Another fragment from that backend executes Finalize() and reports
  that it was cancelled. This causes the coordinator to consider the
  entire backend to be 'done'.
- A third fragment, which had not previously sent a report from the
  reporting thread, from the same backend executes Finalize(). This
  report will not be applied by the coordinator as the backend is
  considered 'done', so this fragment will not appear in the final
  profile.

The solution is to change the definition of 'done' to not include a
backend that has been cancelled but still has fragments that haven't
completed. This guarantees that for queries that complete successfully
and are cancelled internally, all fis will send a report and have it
applied by the coordinator before all results have been returned,
since if eos is true Coordinator::GetNext() calls
WaitForBackendCompletion(), which in this situation will now wait for
all fis to Finalize().

Returning results for queries that are cancelled by the user is
unaffected as the manual cancel path causes WaitForBackendCompletion().

Testing:
- Ran test_profile_fragment_instances in a loop with no failures.
  I can reliably repro the original problem with a few carefully
  placed sleeps.

Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M tests/query_test/test_observability.py
5 files changed, 27 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/8997/4
--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-01-25 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8997 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 3: Code-Review+1

(1 comment)

The change LGTM.

The only other thing I want to bring up is that we're preferring a full profile 
over a quick teardown of a query. From a correctness point of view, that's 
sound. However, that may regress short queries in terms of total query time 
taken. But that's probably a topic for a much larger discussion that can be 
addressed as part of IMPALA-2990.

Since this change changes the lifecycle of a query a bit, it would be better if 
someone else had a look too.

http://gerrit.cloudera.org:8080/#/c/8997/3/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/8997/3/be/src/runtime/coordinator.cc@462
PS3, Line 462:  
nit: Not your change, but we can get rid of this bad indent.



--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Thu, 25 Jan 2018 23:41:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-01-23 Thread Thomas Tauber-Marshall (Code Review)
Hello Michael Ho, Sailesh Mukil, Joe McDonnell,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8997

to look at the new patch set (#3).

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..

IMPALA-6338: Fix flaky test_profile_fragment_instances

test_profile_fragment_instances checks that, once all the results have
been returned, every fragment instance appears in the query profile
for a query that internally cancels fragment instances that are still
executing when the results have been fully returned.

Every fis is guaranteed to send a profile to the coordinator in
Finalize(), but previously fragment profiles were not applied by the
coordinator if the backend was 'done', defined as either all instances
having completed or one has entered an error state (including
cancelled).

So, the test could fail by the following sequence:
- Some fragment for a particular backend sends an update to the
  coordinator. 'returned_all_results_' is true, so the coordinator
  responds indicating the the backend should cancel its remaining
  fragments.
- Another fragment from that backend executes Finalize() and reports
  that it was cancelled. This causes the coordinator to consider the
  entire backend to be 'done'.
- A third fragment, which had not previously sent a report from the
  reporting thread, from the same backend executes Finalize(). This
  report will not be applied by the coordinator as the backend is
  considered 'done', so this fragment will not appear in the final
  profile.

The solution is to change the definition of 'done' to not include a
backend that has been cancelled but still has fragments that haven't
completed. This guarantees that for queries that complete successfully
and are cancelled internally, all fis will send a report and have it
applied by the coordinator before all results have been returned,
since if eos is true Coordinator::GetNext() calls
WaitForBackendCompletion(), which in this situation will now wait for
all fis to Finalize().

Returning results for queries that are cancelled by the user is
unaffected as the manual cancel path causes WaitForBackendCompletion().

Testing:
- Ran test_profile_fragment_instances in a loop with no failures.
  I can reliably repro the original problem with a few carefully
  placed sleeps.

Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M tests/query_test/test_observability.py
5 files changed, 25 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/8997/3
--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-01-23 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8997 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator-backend-state.h@206
PS2, Line 206: either because the client called Cancel() or because the 
coordinator
 :   /// initiated cancel because all the results were already 
returned
> It could also be because another backend hit an error.
Done


http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator-backend-state.h@243
PS2, Line 243: other than cancel
> We're making a base assumption that a CANCELLED status is only obtained as
I don't think that there are any situations in which a fragment instance would 
try to cancel the query without it being the result of some error, and of 
course if the coordinator sees an error it will cancel the query.

I did some grepping around to confirm this, but I'll get someone with more 
authority to chime in.


http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator.cc@926
PS2, Line 926: // If the query was cancelled by the user, don't process the 
update.
 :   if (query_status_.IsCancelled()) return Status::OK();
> Shouldn't we tell the backend to cancel itself if the query was cancelled b
Yes, we're relying on Cancel() having already sent cancels to all of the 
backends.



--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Tue, 23 Jan 2018 21:44:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-01-19 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8997 )

Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator-backend-state.h@206
PS2, Line 206: either because the client called Cancel() or because the 
coordinator
 :   /// initiated cancel because all the results were already 
returned
It could also be because another backend hit an error.

Do we know of any other ways cancellation can be initiated? See more in my 
comment below.


http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator-backend-state.h@243
PS2, Line 243: other than cancel
We're making a base assumption that a CANCELLED status is only obtained as a 
result of the Coordinator setting and propagating it (for any reason).

But are we sure that's true (It would be ideal if that were true)? i.e. are we 
sure that there's no fragment instance cancelling itself (and hence the query) 
for some other reason?

Because if that was the case, then with this patch, a fragment instance trying 
to cancel the query wouldn't cause the query to be cancelled. Since the 
coordinator is assuming that if it sees a CANCELLED status from a fragment 
instance, that's due to the coordinator setting it.


http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/8997/2/be/src/runtime/coordinator.cc@926
PS2, Line 926: // If the query was cancelled by the user, don't process the 
update.
 :   if (query_status_.IsCancelled()) return Status::OK();
Shouldn't we tell the backend to cancel itself if the query was cancelled by 
the user? Or are we relying on the Cancel() RPCs doing that?



--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Comment-Date: Sat, 20 Jan 2018 00:45:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6338: Fix flaky test profile fragment instances

2018-01-10 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8997


Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
..

IMPALA-6338: Fix flaky test_profile_fragment_instances

test_profile_fragment_instances checks that, once all the results have
been returned, every fragment instance appears in the query profile
for a query that internally cancels fragment instances that are still
executing when the results have been fully returned.

Every fis is guaranteed to send a profile to the coordinator in
Finalize(), but previously fragment profiles were not applied by the
coordinator if the backend was 'done', defined as either all instances
having completed or one has entered an error state (including
cancelled).

So, the test could fail by the following sequence:
- Some fragment for a particular backend sends an update to the
  coordinator. 'returned_all_results_' is true, so the coordinator
  responds indicating the the backend should cancel its remaining
  fragments.
- Another fragment from that backend executes Finalize() and reports
  that it was cancelled. This causes the coordinator to consider the
  entire backend to be 'done'.
- A third fragment, which had not previously sent a report from the
  reporting thread, from the same backend executes Finalize(). This
  report will not be applied by the coordinator as the backend is
  considered 'done', so this fragment will not appear in the final
  profile.

The solution is to change the definition of 'done' to not include a
backend that has been cancelled but still has fragments that haven't
completed. This guarantees that for queries that complete successfully
and are cancelled internally, all fis will send a report and have it
applied by the coordinator before all results have been returned,
since if eos is true Coordinator::GetNext() calls
WaitForBackendCompletion(), which in this situation will now wait for
all fis to Finalize().

Returning results for queries that are cancelled by the user is
unaffected as the manual cancel path causes WaitForBackendCompletion().

Testing:
- Ran test_profile_fragment_instances in a loop with no failures.
  I can reliably repro the original problem with a few carefully
  placed sleeps.

Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M tests/query_test/test_observability.py
5 files changed, 25 insertions(+), 18 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/8997/1
--
To view, visit http://gerrit.cloudera.org:8080/8997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3a1e3c62952003f37f88fe2b662bb11889ed
Gerrit-Change-Number: 8997
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall