[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8631 )

Change subject: IMPALA-6227: deflake admission stress tests
..


Patch Set 5: Code-Review+2

Carry


--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 22 Nov 2017 23:17:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8631 )

Change subject: IMPALA-6227: deflake admission stress tests
..


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1518/


--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 22 Nov 2017 23:17:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8631 )

Change subject: IMPALA-6227: deflake admission stress tests
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py@698
PS4, Line 698: every
 :   # second
> shouldn't that say "at least every second" or something?
Done



--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 22 Nov 2017 23:16:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Tim Armstrong (Code Review)
Hello Bikramjeet Vig, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8631

to look at the new patch set (#5).

Change subject: IMPALA-6227: deflake admission stress tests
..

IMPALA-6227: deflake admission stress tests

The problem was that, during the initial admission decision phase, some
queries were initially queued then dequeued once memory came available.
All of the accounting in the test implicitly relies on queries not being
dequeued until queries are later explicitly ended, so if this happened,
the test broke in multiple subtle ways.

This happened because the query only scanned a small number of
rows, which could be all buffered on the receiver side of the
exchange even before the client fetched any rows from the coordinator.
This means that the reserved memory on some backends could increase
then decrease during the initial admission phase, resulting in a
query being queued then dequeued.

The fix is to increase the number of rows returned by the query so that
all fragments remain active during the initial admission phase.
This increased test execution time somewhat, so I also had to bump the
queue wait timeout for the admission stress tests (they assume that
queries don't time out in the queue).

Testing:
Ran the test under debug, release and ASAN builds, i.e.

  impala-py.test tests/custom_cluster/test_admission_controller.py \
--workload_exploration_strategy="functional-query:exhaustive"

I looped the mem_limit test for a while to confirm it didn't reproduce
(it reproduced reliably every 2-3 iterations before this fix).

It still reproduces every 5-10 runs with exhaustive+release, so I
need to do further work to make it more robust.

Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
---
M fe/src/test/resources/llama-site-test2.xml
M tests/custom_cluster/test_admission_controller.py
2 files changed, 116 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/8631/5
--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8631 )

Change subject: IMPALA-6227: deflake admission stress tests
..


Patch Set 4: Code-Review+2

(1 comment)

Seems like an improvement if you want to commit it here even if it doesn't 
address all flakiness.

http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py@698
PS4, Line 698: every
 :   # second
shouldn't that say "at least every second" or something?



--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 22 Nov 2017 23:14:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8631 )

Change subject: IMPALA-6227: deflake admission stress tests
..


Patch Set 3:

(1 comment)

Just to summarise where this is at, it's a lot less flaky but I'm still seeing 
occasional flakes that I think are the result of the test seeing inconsistent 
metrics (i.e. admitted, queued, dequeued, etc are out of sync).

http://gerrit.cloudera.org:8080/#/c/8631/3/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/8631/3/tests/custom_cluster/test_admission_controller.py@706
PS3, Line 706: sleep(0.2)
> what's that about? does the QUERY_TIMEOUT comment above need updating?
I was experimenting with ways to reduce test runtime and reduce the chances of 
it hitting various timeouts.

Factored out a constant to make the relationship clearer.



--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 22 Nov 2017 23:03:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Tim Armstrong (Code Review)
Hello Bikramjeet Vig, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8631

to look at the new patch set (#4).

Change subject: IMPALA-6227: deflake admission stress tests
..

IMPALA-6227: deflake admission stress tests

The problem was that, during the initial admission decision phase, some
queries were initially queued then dequeued once memory came available.
All of the accounting in the test implicitly relies on queries not being
dequeued until queries are later explicitly ended, so if this happened,
the test broke in multiple subtle ways.

This happened because the query only scanned a small number of
rows, which could be all buffered on the receiver side of the
exchange even before the client fetched any rows from the coordinator.
This means that the reserved memory on some backends could increase
then decrease during the initial admission phase, resulting in a
query being queued then dequeued.

The fix is to increase the number of rows returned by the query so that
all fragments remain active during the initial admission phase.
This increased test execution time somewhat, so I also had to bump the
queue wait timeout for the admission stress tests (they assume that
queries don't time out in the queue).

Testing:
Ran the test under debug, release and ASAN builds, i.e.

  impala-py.test tests/custom_cluster/test_admission_controller.py \
--workload_exploration_strategy="functional-query:exhaustive"

I looped the mem_limit test for a while to confirm it didn't reproduce
(it reproduced reliably every 2-3 iterations before this fix).

It still reproduces every 5-10 runs with exhaustive+release, so I
need to do further work to make it more robust.

Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
---
M fe/src/test/resources/llama-site-test2.xml
M tests/custom_cluster/test_admission_controller.py
2 files changed, 114 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/8631/4
--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Tim Armstrong (Code Review)
Hello Bikramjeet Vig, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8631

to look at the new patch set (#2).

Change subject: IMPALA-6227: deflake admission stress tests
..

IMPALA-6227: deflake admission stress tests

The problem was that, during the initial admission decision phase, some
queries were initially queued then dequeued once memory came available.
All of the accounting in the test implicitly relies on queries not being
dequeued until queries are later explicitly ended, so if this happened,
the test broke in multiple subtle ways.

This happened because the query only scanned a small number of
rows, which could be all buffered on the receiver side of the
exchange even before the client fetched any rows from the coordinator.
This means that the reserved memory on some backends could increase
then decrease during the initial admission phase, resulting in a
query being queued then dequeued.

The fix is to increase the number of rows returned by the query so that
all fragments remain active during the initial admission phase.
This increased test execution time somewhat, so I also had to bump the
queue wait timeout for the admission stress tests (they assume that
queries don't time out in the queue).

Testing:
Ran the test under debug, release and ASAN builds, i.e.

  impala-py.test tests/custom_cluster/test_admission_controller.py \
--workload_exploration_strategy="functional-query:exhaustive"

I looped the mem_limit test for a while to confirm it didn't reproduce
(it reproduced reliably every 2-3 iterations before this fix).

I will try looping it a bit more under different build types
concurrently with the review.

Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
---
M fe/src/test/resources/llama-site-test2.xml
M tests/custom_cluster/test_admission_controller.py
2 files changed, 89 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/8631/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 


[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests

2017-11-22 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8631 )

Change subject: IMPALA-6227: deflake admission stress tests
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8631/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8631/1//COMMIT_MSG@11
PS1, Line 11: All of the accounting in the test implicitly relies on queries 
not being
: dequeued until queries are later explicitly ended, so if this 
happened,
: the test broke in multiple subtle ways.
> and this assumption is no longer true because of the final change for IMPAL
I don't think it was ever true for the mem_limit tests. Even before IMPALA-1575 
it depended implicitly on the memory staying reserved on all backends.

>From what I can tell, the bug was always there, it was a tweak to the 
>statestore frequency that threw it out of balance.


http://gerrit.cloudera.org:8080/#/c/8631/1/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/8631/1/tests/custom_cluster/test_admission_controller.py@705
PS1, Line 705: b4
> where is that used?
Leftover debugging code that I missed removing.


http://gerrit.cloudera.org:8080/#/c/8631/1/tests/custom_cluster/test_admission_controller.py@792
PS1, Line 792: amount of time
> is this actually timing sensitive, or is it that as long as we don't fetch
It shouldn't be timing sensitive. Updated the comment.



--
To view, visit http://gerrit.cloudera.org:8080/8631
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Gerrit-Change-Number: 8631
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 22 Nov 2017 22:45:50 +
Gerrit-HasComments: Yes