[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-27 Thread Rui Fan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17639802#comment-17639802
 ] 

Rui Fan commented on FLINK-30184:
-

Hi [~xtsong] , thanks for your explanation.

It sounds reasonable, I will close this JIRA.

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-27 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17639775#comment-17639775
 ] 

Xintong Song commented on FLINK-30184:
--

[~fanrui], sorry for the late response.

I agree with [~wangyang0918] that this is probably more suitable for an 
external service that manages / monitors Flink.

Thread dumps are for debugging and should not be activated constantly given the 
performance impact. Flink already offers rest api for capturing thread stacks 
of 
[jobmanager|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobmanager-thread-dump]
 and 
[taskmanager|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#taskmanagers-taskmanagerid-thread-dump].
 It should be easy for an external monitoring system to capture the dumps when 
the job is detected to be slow.

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-25 Thread Rui Fan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638560#comment-17638560
 ] 

Rui Fan commented on FLINK-30184:
-

Hi [~wangyang0918] , thanks for your feedback.

In fact, this feature is useful for trouble shooting, and I know that some 
companies do this with yarn.

However, the yarn of apache version doesn't have this feature. And too many 
companies don't maintain their internal yarn version. So I'm not sure if this 
should be done on the flink side.

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-24 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638310#comment-17638310
 ] 

Yang Wang commented on FLINK-30184:
---

I lean towards to make this be done outside of Flink.

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-24 Thread Yun Gao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638198#comment-17638198
 ] 

Yun Gao commented on FLINK-30184:
-

Thanks [~fanrui] for the explanation! Now I got the issue. 

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-24 Thread Rui Fan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638192#comment-17638192
 ] 

Rui Fan commented on FLINK-30184:
-

Hi [~gaoyunhaii] , thanks for your reply. I'm sorry, I may not have expressed 
my thoughts clearly.

As I understand, Flame Graph can only display the current stack.

What I mean is: if it's 5am, the Flink job lag is huge. The Flink user did not 
troubleshoot the problem in time, and started to troubleshoot after working 
hours. However, Metric can only find out which Task is slow, and cannot analyze 
where the Task is stuck at 5 o'clock in the morning? Why are Tasks slow? The 
Thread stack at 5am can help users analyze where the task is stuck.

In general, the historical thread stack can tell the user what the flink job is 
doing every minute.

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-24 Thread Yun Gao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638166#comment-17638166
 ] 

Yun Gao commented on FLINK-30184:
-

Hi [~fanrui] perhaps [FlameGraph | 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/flame_graphs/]
 could provide the functionality? 

 

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30184) Save TM/JM thread stack periodically

2022-11-23 Thread Rui Fan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638128#comment-17638128
 ] 

Rui Fan commented on FLINK-30184:
-

Hi [~xtsong] , please help take a look in your free time. And if it makes 
sense, please assign it to me, thanks~

> Save TM/JM thread stack periodically
> 
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Rui Fan
>Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the 
> thread stack of TM/JM in Flink WebUI. 
> It can help flink users to find out why the Flink job is stuck, or why the 
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user 
> troubleshoots the problem, the job has resumed. It is difficult to find out 
> what happened to the Flink job at the time and why is it slow?
>  
> So, could we periodically save the thread stack of TM or JM in the TM log 
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours



--
This message was sent by Atlassian Jira
(v8.20.10#820010)