[jira] [Comment Edited] (FLINK-26400) Release Testing: Explicit shutdown signalling from TaskManager to JobManager

2022-03-02 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500466#comment-17500466
 ] 

Zhu Zhu edited comment on FLINK-26400 at 3/3/22, 2:44 AM:
--

Here's what I see on the page "Limited integration with Flink’s Web UI: 
Adaptive Scheduler allows that a job’s parallelism can change over its 
lifetime. The web UI only shows the current parallelism the job."

Seems the two problems listed above are not described in the known limitation. 
So I think they need to be fixed. But I agree that they are not blockers of 
1.15 because the problem has been there for some versions and are not blocker 
to users.

Thanks for updating FLINK-22243. I have attached one picture to show problem #1.


was (Author: zhuzh):
Here's what I see on the page "Limited integration with Flink’s Web UI: 
Adaptive Scheduler allows that a job’s parallelism can change over its 
lifetime. The web UI only shows the current parallelism the job."

Seems the two problems listed above are not described in the known limitation. 
So I think they need to be fixed. But I agree that they are not blockers of 
1.15 because the problem has been there for some versions and not blocker for 
users.

Thanks for updating FLINK-22243. I have attached one picture to show problem #1.

> Release Testing: Explicit shutdown signalling from TaskManager to JobManager
> 
>
> Key: FLINK-26400
> URL: https://issues.apache.org/jira/browse/FLINK-26400
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Niklas Semmler
>Assignee: Zhu Zhu
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.15.0
>
> Attachments: errors_on_opening_job_page_when_job_gets_no_resources.png
>
>
> FLINK-25277 introduces explicit signalling between a TaskManager and the 
> JobManager when the TaskManager shuts down. This reduces the time it takes 
> for a reactive cluster to down-scale & restart.
>  
> *Setup*
>  # Add the following line to your flink config to enable reactive mode:
> {code}
> taskmanager.host: localhost # a workaround
> scheduler-mode: reactive
> restart-strategy: fixeddelay
> restart-strategy.fixed-delay.attempts: 100
> {code}
>  # Create a “usrlib” folder and place the TopSpeedWindowing jar into it
> {code:bash}
> $ mkdir usrlib
> $ cp examples/streaming/TopSpeedWindowing.jar usrlib/
> {code}
>  # Start the job 
> {code:bash}
> $ bin/standalone-job.sh start  --main-class 
> org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
> {code}
>  # Start three task managers
> {code:bash}
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> {code}
>  # Wait for the job to stabilize. The log file should show that three tasks 
> start for every operator.
> {code}
>  GlobalWindows -> Sink: Print to Std. Out (3/3) 
> (d10339d5755d07f3d9864ed1b2147af2) switched from INITIALIZING to 
> RUNNING.{code}
> *Test*
> Stop one taskmanager
> {code:bash}
> $ bin/taskmanager.sh stop
> {code}
> Success condition: You should see that the job cancels and re-runs after a 
> few seconds. In the logs you should find a line with the text “The 
> TaskExecutor is shutting down”.
> *Teardown*
> Stop all taskmanagers and the jobmanager:
> {code:bash}
> $ bin/standalone-job.sh stop
> $ bin/taskmanager.sh stop-all
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-26400) Release Testing: Explicit shutdown signalling from TaskManager to JobManager

2022-03-02 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1740#comment-1740
 ] 

Zhu Zhu edited comment on FLINK-26400 at 3/2/22, 9:09 AM:
--

I have done the test and the result looks good. The stopped TaskManager can be 
quickly identified by the JobManager to trigger a restart. I also tried 
directly `kill` a TaskManager (in SIGTERM way) and it also works well. 

Thanks for this improvement [~nsemmler]! I think it is very useful even for 
non-reactive mode. Because taking away a problematic machine is very common in 
production, and in our experience it is a common cause of Flink streaming job 
failures. This improvement can speed up the error detection, and hence speed up 
the job recovery.

I noticed 2 problems which are out of the scope of this test though.
1. When there is no resource and job is in CREATED state, I will get an error 
"Job failed during initialization of JobManager" when trying to open the job 
page, and cannot see the job topology or other job informations.
2. The parallelisms of graph nodes shown in the web UI is not updated when job 
vertex parallelisms change. I guess the root cause is the json plan is not 
updated on job vertex parallelism changes. Because we encountered the same 
issue when developing AdaptiveBatchScheduler. As a reference, our solution can 
be found 
[here|https://github.com/apache/flink/blob/152ad4fc14920372076c0004793c179141ae10c7/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java#L224].


was (Author: zhuzh):
I have done the test and the result looks good. The stopped TaskManager can be 
quickly identified by the JobManager to trigger a restart. I also tried 
directly `kill` a TaskManager (in SIGTERM way) and it also works well. 
Thanks for this improvement [~nsemmler]! I think it is very useful even for 
non-reactive mode. Because taking away a problematic machine is very common in 
production, and in our experience it is a common cause of Flink streaming job 
failures. This improvement can speed up the error detection, and hence speed up 
the job recovery.

I noticed 2 problems which are out of the scope of this test though.
1. When there is no resource and job is in CREATED state, I will get an error 
"Job failed during initialization of JobManager" when trying to open the job 
page, and cannot see the job topology or other job informations.
2. The parallelism shown in the graph is not updated when a job vertex 
parallelism changes. I guess the root cause is the json plan is not updated on 
job vertex parallelism changes. Because we encountered the same issue when 
developing AdaptiveBatchScheduler. As a reference, our solution can be found 
[here|https://github.com/apache/flink/blob/152ad4fc14920372076c0004793c179141ae10c7/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java#L224].

> Release Testing: Explicit shutdown signalling from TaskManager to JobManager
> 
>
> Key: FLINK-26400
> URL: https://issues.apache.org/jira/browse/FLINK-26400
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.15.0
>Reporter: Niklas Semmler
>Assignee: Zhu Zhu
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.15.0
>
>
> FLINK-25277 introduces explicit signalling between a TaskManager and the 
> JobManager when the TaskManager shuts down. This reduces the time it takes 
> for a reactive cluster to down-scale & restart.
>  
> *Setup*
>  # Add the following line to your flink config to enable reactive mode:
> {code}
> taskmanager.host: localhost # a workaround
> scheduler-mode: reactive
> restart-strategy: fixeddelay
> restart-strategy.fixed-delay.attempts: 100
> {code}
>  # Create a “usrlib” folder and place the TopSpeedWindowing jar into it
> {code:bash}
> $ mkdir usrlib
> $ cp examples/streaming/TopSpeedWindowing.jar usrlib/
> {code}
>  # Start the job 
> {code:bash}
> $ bin/standalone-job.sh start  --main-class 
> org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
> {code}
>  # Start three task managers
> {code:bash}
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> {code}
>  # Wait for the job to stabilize. The log file should show that three tasks 
> start for every operator.
> {code}
>  GlobalWindows -> Sink: Print to Std. Out (3/3) 
> (d10339d5755d07f3d9864ed1b2147af2) switched from INITIALIZING to 
> RUNNING.{code}
> *Test*
> Stop one taskmanager
> {code:bash}
> $ bin/taskmanager.sh stop
> {code}
> Success condition: You should see that the job cancels and re-runs after a 
> few seconds. In the logs you should find a line