[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2023-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-20886:
---
Labels: auto-deprioritized-major auto-deprioritized-minor 
pull-request-available stale-assigned usability  (was: auto-deprioritized-major 
auto-deprioritized-minor stale-assigned usability)

> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Assignee: Zakelly Lan
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> pull-request-available, stale-assigned, usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2023-10-03 Thread Flink Jira Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-20886:
---
Labels: auto-deprioritized-major auto-deprioritized-minor stale-assigned 
usability  (was: auto-deprioritized-major auto-deprioritized-minor usability)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I see this issue is assigned but has not 
received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a 
comment updating the community on your progress.  If this issue is waiting on 
feedback, please consider this a reminder to the committer/reviewer. Flink is a 
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone 
else may work on it.


> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Assignee: Zakelly Lan
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> stale-assigned, usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2023-09-03 Thread Zakelly Lan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-20886:

Priority: Minor  (was: Not a Priority)

> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Assignee: Zakelly Lan
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-11-07 Thread Flink Jira Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-20886:
---
  Labels: auto-deprioritized-major auto-deprioritized-minor usability  
(was: auto-deprioritized-major stale-minor usability)
Priority: Not a Priority  (was: Minor)

This issue was labeled "stale-minor" 7 days ago and has not received any 
updates so it is being deprioritized. If this ticket is actually Minor, please 
raise the priority and ask a committer to assign you the issue or revive the 
public discussion.


> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-10-29 Thread Flink Jira Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-20886:
---
Labels: auto-deprioritized-major stale-minor usability  (was: 
auto-deprioritized-major usability)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I see this issues has been marked as 
Minor but is unassigned and neither itself nor its Sub-Tasks have been updated 
for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is 
still Minor, please either assign yourself or give an update. Afterwards, 
please remove the label or in 7 days the issue will be deprioritized.


> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Minor
>  Labels: auto-deprioritized-major, stale-minor, usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-04-29 Thread Flink Jira Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-20886:
---
Labels: auto-deprioritized-major usability  (was: stale-major usability)

> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Major
>  Labels: auto-deprioritized-major, usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-04-29 Thread Flink Jira Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-20886:
---
Priority: Minor  (was: Major)

> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Minor
>  Labels: auto-deprioritized-major, usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-04-22 Thread Flink Jira Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-20886:
---
Labels: stale-major usability  (was: usability)

> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Major
>  Labels: stale-major, usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-01-15 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-20886:

Labels: usability  (was: )

> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Major
>  Labels: usability
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-01-07 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-20886:

Description: 
For debugging checkpoint timeouts, I was thinking about the following addition 
to Flink:

When a checkpoint times out and the async thread is still running, create a 
thread dump [1] and either add this to the checkpoint stats, log it, or write 
it out.

This may help identifying where the checkpoint is stuck (maybe a lock, could 
also be in a third party lib like the FS connectors,...). It would give us some 
insights into what the thread is currently doing.

Limiting the scope of the threads would be nice but may not be possible in the 
general case since additional threads (spawned by the FS connector lib, or 
otherwise connected) may interact with the async thread(s) by e.g. going 
through the same locks. Maybe we can reduce the thread dumps to all async 
threads of the failed checkpoint + all thready that interact with it, e.g. via 
locks?

I'm also not sure whether the ability to have thread dumps or not should be 
user-configurable (Could it contain sensitive information from other jobs if 
you run a session cluster? Is that even relevant since we don't give isolation 
guarantees anyway?). If it is configurable, it should be on by default.


[1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/

  was:
For debugging checkpoint timeouts, I was thinking about the following addition 
to Flink:

When a checkpoint times out and the async thread is still running, create a 
threaddump [1] and either add this to the checkpoint stats, log it, or write it 
out.

This may help identifying where the checkpoint is stuck (maybe a lock, could 
also be in a third party lib like the FS connectors,...). It would give us some 
insights into what the thread is currently doing.

Limiting the scope of the threads would be nice but may not be possible in the 
general case since additional threads (spawned by the FS connector lib, or 
otherwise connected) may interact with the async thread(s) by e.g. going 
through the same locks.


[1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/


> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Major
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

2021-01-07 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-20886:

Affects Version/s: (was: 1.12.0)

> Add the option to get a threaddump on checkpoint timeouts
> -
>
> Key: FLINK-20886
> URL: https://issues.apache.org/jira/browse/FLINK-20886
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Nico Kruber
>Priority: Major
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> threaddump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)