[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531149#comment-16531149
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/6216


> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531118#comment-16531118
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/6216
  
merging.


> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531072#comment-16531072
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/6216#discussion_r199738568
  
--- Diff: 
flink-end-to-end-tests/test-scripts/test_queryable_state_restart_tm.sh ---
@@ -85,20 +90,23 @@ function run_test() {
 exit 1
 fi
 
-local 
current_num_checkpoints=current_num_checkpoints$(get_completed_number_of_checkpoints
 ${JOB_ID})
-
 kill_random_taskmanager
 
 latest_snapshot_count=$(cat $FLINK_DIR/log/*out* | grep "on snapshot" 
| tail -n 1 | awk '{print $4}')
 echo "Latest snapshot count was ${latest_snapshot_count}"
 
-sleep 65 # this is a little longer than the heartbeat timeout so that 
the TM is gone
+# wait until the TM loss was detected
+wait_for_job_state_transition ${JOB_ID} "RESTARTING" "CREATED"
 
 start_and_wait_for_tm
 
+wait_job_running ${JOB_ID}
+
+local current_num_checkpoints="$(get_completed_number_of_checkpoints 
${JOB_ID})"
--- End diff --

So when I ran this test locally it frequently occurred that we never 
actually waited for checkpoints to complete; the number of checkpoint that we 
waited for had already occurred and we exited early.
The job just switches to running, and right away we got 3 completed 
checkpoints?
This was a bit, well, _odd_. 

In the previous version the checkpoint count could further increase while 
we shut down the TM.
Technically there's no guarantee how up-to-date the result if 
`get_completed_number_of_checkpoints` or how long `kill_random_taskmanager` 
actually takes, so we may have fulfilled the checkpoint count condition before 
the job even restarts. It's admittedly unlikely.

This change simply guarantees that the job actually completes N checkpoints 
after it was restarted, and just serves to eliminate doubts about the 
correctness of the test.


> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531040#comment-16531040
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user florianschmidt1994 commented on the issue:

https://github.com/apache/flink/pull/6216
  
> (Did you start a review, but wrote your comment separately? Then your 
review would still be in progress!)

Yes :D I submitted it now

>The issues you identified are certainly valid, but I'm wondering if it 
really makes sense to invest time into this now.
>There've been discussions about writing a python/java based framework for 
the tests (and I've already started tinkering on a java version); so any 
significant changes we make to the bash-scripts might be subsumed soon™.

Alright, then let's leave it this way and we can still take care of it 
should we run into problems with this in the future



> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531034#comment-16531034
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user florianschmidt1994 commented on a diff in the pull request:

https://github.com/apache/flink/pull/6216#discussion_r199719951
  
--- Diff: 
flink-end-to-end-tests/test-scripts/test_queryable_state_restart_tm.sh ---
@@ -85,20 +90,23 @@ function run_test() {
 exit 1
 fi
 
-local 
current_num_checkpoints=current_num_checkpoints$(get_completed_number_of_checkpoints
 ${JOB_ID})
-
 kill_random_taskmanager
 
 latest_snapshot_count=$(cat $FLINK_DIR/log/*out* | grep "on snapshot" 
| tail -n 1 | awk '{print $4}')
 echo "Latest snapshot count was ${latest_snapshot_count}"
 
-sleep 65 # this is a little longer than the heartbeat timeout so that 
the TM is gone
+# wait until the TM loss was detected
+wait_for_job_state_transition ${JOB_ID} "RESTARTING" "CREATED"
 
 start_and_wait_for_tm
 
+wait_job_running ${JOB_ID}
+
+local current_num_checkpoints="$(get_completed_number_of_checkpoints 
${JOB_ID})"
--- End diff --

Why did you move this from `before killing the TM` to `after having the TM 
restarted again` ?


> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531030#comment-16531030
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/6216
  
@florianschmidt1994 I don't see the comment you're referring to :(
(Did you start a review, but wrote your comment separately? Then your 
review would still be in progress!)

The issues you identified are certainly valid, but I'm wondering if it 
really makes sense to invest time into this now.
There've been discussions about writing a python/java based framework for 
the tests (and I've already started tinkering on a java version); so any 
_significant_ changes we make to the bash-scripts might be subsumed soon.


> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531006#comment-16531006
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user florianschmidt1994 commented on the issue:

https://github.com/apache/flink/pull/6216
  
Thanks @zentol, I have one remark (see above), besides that looks good to 
me!

Additionally I had some ideas that came to mind that I think we could 
discuss:
- We have a common pattern of `wait_for_sth` functions, that either
  - get stuck in a loop for ever if the desired event doesn't happen (I 
think `wait_for_job_state_transition` also behaves like that, right?)
  - or iterate a fixed number of times and then continue execution, whereas 
instead they should fail.

I think we should add an issue for that to refactor that over all the tests 
to have consistent and useful behaviour
- Also I think that we could have the backup config and revert config as 
part of the test runner  and always do that, so we avoid running into a 
corrupted flink-dist if tests don't behave correctly?

What do you think!



> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-07-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530061#comment-16530061
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

Github user yanghua commented on the issue:

https://github.com/apache/flink/pull/6216
  
+1


> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9674) Remove 65s sleep in QueryableState E2E test

2018-06-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524924#comment-16524924
 ] 

ASF GitHub Bot commented on FLINK-9674:
---

GitHub user zentol opened a pull request:

https://github.com/apache/flink/pull/6216

[FLINK-9674][tests] Replace hard-coded sleeps in QS E2E test 

## What is the purpose of the change

This PR replaces a hard-coded sleep in a QueryableState end-to-end test. 
The test was sleeping for 65 seconds after the TM was restarted, to wait until 
the restarted.

Instead we now first wait for the job restart procedure to kick in (using a 
new method to check for state transitions) and then waiting for the job to run.

Additionally
* reduce `heartbeat.timeout` to speed up TM loss detection
* fix `common#wait_job_running` returning true for scheduled jobs, we now 
pass the `-r` flag
* simplify jar paths in QS test (now relative to END_TO_END_DIR)
* change `test-runner-common#cleanup` to use existing 
`common#clean_[log|stdout]_files` functions

## Verifying this change

* manually verified

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zentol/flink 9674

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6216.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6216


commit b6ab955173e1813b9903bf1fb2eb1345ff17abd3
Author: zentol 
Date:   2018-06-27T10:58:49Z

[hotfix][tests] Reuse existing functions for cleaning logs

commit b4866c55bf64710aa5f21d214c86caa4702dd73e
Author: zentol 
Date:   2018-06-27T10:59:16Z

[hotfix][tests] Simplify jar paths to QS tests

commit 2e4788f5c6840780c4ff0152a91435087802a14e
Author: zentol 
Date:   2018-06-27T11:01:06Z

[FLINK-9674][tests] Replace hard-coded sleeps in QS E2E test




> Remove 65s sleep in QueryableState E2E test
> ---
>
> Key: FLINK-9674
> URL: https://issues.apache.org/jira/browse/FLINK-9674
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
>
> The {{test_queryable_state_restart_tm.sh}} kills a taskmanager, waits for the 
> loss to be noticed, starts a new tm and waits for the job to continue.
> {code}
> kill_random_taskmanager
> [...]
> sleep 65 # this is a little longer than the heartbeat timeout so that the TM 
> is gone
> start_and_wait_for_tm
> {code}
> Instead of waiting for a fixed amount of time that is tied to some config 
> value we should wait for a specific event, like the job being canceled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)