[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-20886: --- Labels: auto-deprioritized-major auto-deprioritized-minor pull-request-available stale-assigned usability (was: auto-deprioritized-major auto-deprioritized-minor stale-assigned usability) > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Assignee: Zakelly Lan >Priority: Minor > Labels: auto-deprioritized-major, auto-deprioritized-minor, > pull-request-available, stale-assigned, usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-20886: --- Labels: auto-deprioritized-major auto-deprioritized-minor stale-assigned usability (was: auto-deprioritized-major auto-deprioritized-minor usability) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issue is assigned but has not received an update in 30 days, so it has been labeled "stale-assigned". If you are still working on the issue, please remove the label and add a comment updating the community on your progress. If this issue is waiting on feedback, please consider this a reminder to the committer/reviewer. Flink is a very active project, and so we appreciate your patience. If you are no longer working on the issue, please unassign yourself so someone else may work on it. > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Assignee: Zakelly Lan >Priority: Minor > Labels: auto-deprioritized-major, auto-deprioritized-minor, > stale-assigned, usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zakelly Lan updated FLINK-20886: Priority: Minor (was: Not a Priority) > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Assignee: Zakelly Lan >Priority: Minor > Labels: auto-deprioritized-major, auto-deprioritized-minor, > usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-20886: --- Labels: auto-deprioritized-major auto-deprioritized-minor usability (was: auto-deprioritized-major stale-minor usability) Priority: Not a Priority (was: Minor) This issue was labeled "stale-minor" 7 days ago and has not received any updates so it is being deprioritized. If this ticket is actually Minor, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor, > usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-20886: --- Labels: auto-deprioritized-major stale-minor usability (was: auto-deprioritized-major usability) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized. > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Minor > Labels: auto-deprioritized-major, stale-minor, usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-20886: --- Labels: auto-deprioritized-major usability (was: stale-major usability) > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Major > Labels: auto-deprioritized-major, usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-20886: --- Priority: Minor (was: Major) > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Minor > Labels: auto-deprioritized-major, usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-20886: --- Labels: stale-major usability (was: usability) > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Major > Labels: stale-major, usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nico Kruber updated FLINK-20886: Labels: usability (was: ) > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Major > Labels: usability > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nico Kruber updated FLINK-20886: Description: For debugging checkpoint timeouts, I was thinking about the following addition to Flink: When a checkpoint times out and the async thread is still running, create a thread dump [1] and either add this to the checkpoint stats, log it, or write it out. This may help identifying where the checkpoint is stuck (maybe a lock, could also be in a third party lib like the FS connectors,...). It would give us some insights into what the thread is currently doing. Limiting the scope of the threads would be nice but may not be possible in the general case since additional threads (spawned by the FS connector lib, or otherwise connected) may interact with the async thread(s) by e.g. going through the same locks. Maybe we can reduce the thread dumps to all async threads of the failed checkpoint + all thready that interact with it, e.g. via locks? I'm also not sure whether the ability to have thread dumps or not should be user-configurable (Could it contain sensitive information from other jobs if you run a session cluster? Is that even relevant since we don't give isolation guarantees anyway?). If it is configurable, it should be on by default. [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ was: For debugging checkpoint timeouts, I was thinking about the following addition to Flink: When a checkpoint times out and the async thread is still running, create a threaddump [1] and either add this to the checkpoint stats, log it, or write it out. This may help identifying where the checkpoint is stuck (maybe a lock, could also be in a third party lib like the FS connectors,...). It would give us some insights into what the thread is currently doing. Limiting the scope of the threads would be nice but may not be possible in the general case since additional threads (spawned by the FS connector lib, or otherwise connected) may interact with the async thread(s) by e.g. going through the same locks. [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Major > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > thread dump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. Maybe we can reduce the thread dumps to all async > threads of the failed checkpoint + all thready that interact with it, e.g. > via locks? > I'm also not sure whether the ability to have thread dumps or not should be > user-configurable (Could it contain sensitive information from other jobs if > you run a session cluster? Is that even relevant since we don't give > isolation guarantees anyway?). If it is configurable, it should be on by > default. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nico Kruber updated FLINK-20886: Affects Version/s: (was: 1.12.0) > Add the option to get a threaddump on checkpoint timeouts > - > > Key: FLINK-20886 > URL: https://issues.apache.org/jira/browse/FLINK-20886 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Nico Kruber >Priority: Major > > For debugging checkpoint timeouts, I was thinking about the following > addition to Flink: > When a checkpoint times out and the async thread is still running, create a > threaddump [1] and either add this to the checkpoint stats, log it, or write > it out. > This may help identifying where the checkpoint is stuck (maybe a lock, could > also be in a third party lib like the FS connectors,...). It would give us > some insights into what the thread is currently doing. > Limiting the scope of the threads would be nice but may not be possible in > the general case since additional threads (spawned by the FS connector lib, > or otherwise connected) may interact with the async thread(s) by e.g. going > through the same locks. > [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/ -- This message was sent by Atlassian Jira (v8.3.4#803005)