[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-07 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340608#comment-17340608
 ] 

Robert Metzger commented on FLINK-22406:


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17625=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=f508e270-48d6-5f1e-3138-42a17e0714f0

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0, 1.14.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339543#comment-17339543
 ] 

Till Rohrmann commented on FLINK-22406:
---

Another instance: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17540=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0, 1.14.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Dawid Wysakowicz (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339467#comment-17339467
 ] 

Dawid Wysakowicz commented on FLINK-22406:
--

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17552=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=10983

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0, 1.14.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Dawid Wysakowicz (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339466#comment-17339466
 ] 

Dawid Wysakowicz commented on FLINK-22406:
--

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17552=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=f508e270-48d6-5f1e-3138-42a17e0714f0=5327

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0, 1.14.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Dawid Wysakowicz (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339462#comment-17339462
 ] 

Dawid Wysakowicz commented on FLINK-22406:
--

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17558=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=f508e270-48d6-5f1e-3138-42a17e0714f0=5673

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0, 1.14.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Dawid Wysakowicz (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339455#comment-17339455
 ] 

Dawid Wysakowicz commented on FLINK-22406:
--

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17543=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=f508e270-48d6-5f1e-3138-42a17e0714f0=5667

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Dawid Wysakowicz (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339454#comment-17339454
 ] 

Dawid Wysakowicz commented on FLINK-22406:
--

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17538=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=10541

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Dawid Wysakowicz (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339452#comment-17339452
 ] 

Dawid Wysakowicz commented on FLINK-22406:
--

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17539=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=f508e270-48d6-5f1e-3138-42a17e0714f0=5316

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-05-05 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339436#comment-17339436
 ] 

Roman Khachatryan commented on FLINK-22406:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17544=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=f508e270-48d6-5f1e-3138-42a17e0714f0

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-04-27 Thread Guowei Ma (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334439#comment-17334439
 ] 

Guowei Ma commented on FLINK-22406:
---

another case
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17316=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=10889

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-04-23 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330479#comment-17330479
 ] 

Chesnay Schepler commented on FLINK-22406:
--

I managed to get the debug logs for a failed run, confirming my theory. The 
test starts out with 2 TMs, each with 2 slots. The test then in total runs 3 
jobs; after the first 2 slots arrive with p=2, then with p=4 after the next 2 
slots arrive, and one final with with p=2 again after we shutdown a TM.

The problem is caused by the first job, coupled with how test counts active 
subtasks.

The first job effectively is immediately canceled after having been deployed, 
which can cause the cancellation message to be processed before the task 
deployment, which fails because nothing is deployed (yet), and the JM marks the 
task is canceled and proceeds with a restart. The task deployment later 
arrives, and is processed as usual. This task sticks around until the task 
initialization is complete and the task transitions into a running state, with 
the corresponding state update being rejected by the JM.

After an offline discussion with [~trohrmann], we concluded that we could fix 
this by not checking what is actually being deployed, but instead poll the JM 
for the current parallelism.

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22406) Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()

2021-04-22 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327344#comment-17327344
 ] 

Chesnay Schepler commented on FLINK-22406:
--

There are a few odd things in the logs. It seems like the JM is prematurely 
moving tasks into a canceled state.
{code:java}
23:47:20,274 INFO  o.a.f.r.executiongraph.ExecutionGraph   [] - Source: 
Custom Source -> Sink: Unnamed (2/2) (0d91787b2ba65cd0f259be619b293b96) 
switched from CREATED to DEPLOYING.
23:47:20,274 INFO  o.a.f.r.executiongraph.ExecutionGraph   [] - Source: 
Custom Source -> Sink: Unnamed (2/2) (0d91787b2ba65cd0f259be619b293b96) 
switched from DEPLOYING to CANCELING.
23:47:20,277 INFO  o.a.f.r.executiongraph.ExecutionGraph   [] - Source: 
Custom Source -> Sink: Unnamed (2/2) (0d91787b2ba65cd0f259be619b293b96) 
switched from CANCELING to CANCELED.
23:47:20,282 INFO  o.a.f.r.taskexecutor.TaskExecutor   [] - Received 
task Source: Custom Source -> Sink: Unnamed (2/2)#0 
(0d91787b2ba65cd0f259be619b293b96), deploy into slot with allocation id 
23:47:20,287 INFO  o.a.f.r.taskmanager.Task[] - Source: 
Custom Source -> Sink: Unnamed (2/2)#0 (0d91787b2ba65cd0f259be619b293b96) 
switched from CREATED to DEPLOYING.48a192cd6be4f34599cac87ad5d8caba.
23:47:20,296 INFO  o.a.f.r.taskmanager.Task[] - Source: 
Custom Source -> Sink: Unnamed (2/2)#0 (0d91787b2ba65cd0f259be619b293b96) 
switched from DEPLOYING to INITIALIZING.
23:47:20,327 WARN  o.a.f.r.taskmanager.Task[] - Source: 
Custom Source -> Sink: Unnamed (2/2)#0 (0d91787b2ba65cd0f259be619b293b96) 
switched from INITIALIZING to FAILED with failure cause: 
org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
attempt 0d91787b2ba65cd0f259be619b293b96 was not found. {code}
This doesn't necessarily explain the issue, but with a stray task hanging 
around for longer than we expect it to there is now the possibility that, after 
the downscaling has concluded, the number of active instances is 3. If the test 
thread enters the waiting loop at this time it will never exit, because we 
don't notify the thread if instances are shutting down. This is entirely 
theoretical though, but it is the only explanation I can come up with that 
could cause the test to get stuck.

> Unstable test ReactiveModeITCase.testScaleDownOnTaskManagerLoss()
> -
>
> Key: FLINK-22406
> URL: https://issues.apache.org/jira/browse/FLINK-22406
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Stephan Ewen
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> The test is stalling on Azure CI.
> https://dev.azure.com/sewen0794/Flink/_build/results?buildId=292=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)