[jira] [Commented] (FLINK-8856) Move all interrupt() calls to TaskCanceler

2018-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406938#comment-16406938
 ] 

ASF GitHub Bot commented on FLINK-8856:
---

Github user StephanEwen closed the pull request at:

https://github.com/apache/flink/pull/5658


> Move all interrupt() calls to TaskCanceler
> --
>
> Key: FLINK-8856
> URL: https://issues.apache.org/jira/browse/FLINK-8856
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Affects Versions: 1.4.0, 1.4.1, 1.4.2
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.5.0, 1.4.3
>
>
> We need this to work around the following JVM bug: 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8138622
> To circumvent this problem, the {{TaskCancelerWatchDog}} must not call 
> {{interrupt()}} at all, but only join on the executing thread (with timeout) 
> and cause a hard exit once cancellation takes to long.
> A user affected by this problem reported this in FLINK-8834
> Personal note: The Thread.join(...) method unfortunately is not 100% reliable 
> as well, because it uses {{System.currentTimeMillis()}} rather than 
> {{System.nanoTime()}}. Because of that, sleeps can take overly long when the 
> clock is adjusted. I wonder why the JDK authors do not follow their own 
> recommendations and use {{System.nanoTime()}} for all relative time 
> measures...
> EDIT: I am not the only one wondering why: 
> https://stackoverflow.com/questions/42544387/why-does-thread-join-use-currenttimemillis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8856) Move all interrupt() calls to TaskCanceler

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404623#comment-16404623
 ] 

ASF GitHub Bot commented on FLINK-8856:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/5658
  
Please close this PR @StephanEwen since the commits have been merged.


> Move all interrupt() calls to TaskCanceler
> --
>
> Key: FLINK-8856
> URL: https://issues.apache.org/jira/browse/FLINK-8856
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Affects Versions: 1.4.0, 1.4.1, 1.4.2
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.5.0, 1.4.3
>
>
> We need this to work around the following JVM bug: 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8138622
> To circumvent this problem, the {{TaskCancelerWatchDog}} must not call 
> {{interrupt()}} at all, but only join on the executing thread (with timeout) 
> and cause a hard exit once cancellation takes to long.
> A user affected by this problem reported this in FLINK-8834
> Personal note: The Thread.join(...) method unfortunately is not 100% reliable 
> as well, because it uses {{System.currentTimeMillis()}} rather than 
> {{System.nanoTime()}}. Because of that, sleeps can take overly long when the 
> clock is adjusted. I wonder why the JDK authors do not follow their own 
> recommendations and use {{System.nanoTime()}} for all relative time 
> measures...
> EDIT: I am not the only one wondering why: 
> https://stackoverflow.com/questions/42544387/why-does-thread-join-use-currenttimemillis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8856) Move all interrupt() calls to TaskCanceler

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391298#comment-16391298
 ] 

ASF GitHub Bot commented on FLINK-8856:
---

Github user aljoscha commented on a diff in the pull request:

https://github.com/apache/flink/pull/5658#discussion_r173170798
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/util/ExceptionUtils.java ---
@@ -81,12 +81,15 @@ public static String stringifyException(final Throwable 
e) {
 * Currently considered fatal exceptions are Virtual Machine errors 
indicating
 * that the JVM is corrupted, like {@link InternalError}, {@link 
UnknownError},
 * and {@link java.util.zip.ZipError} (a special case of InternalError).
+* The {@link ThreadDeath} exception is also treated as a fatal error, 
because when
+* a thread is forcefully stopped, there is a high chance that parts of 
the system
+* is in an inconsistent state.
--- End diff --

nit/typo: "are in an inconsistent state"?


> Move all interrupt() calls to TaskCanceler
> --
>
> Key: FLINK-8856
> URL: https://issues.apache.org/jira/browse/FLINK-8856
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.5.0, 1.6.0
>
>
> We need this to work around the following JVM bug: 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8138622
> To circumvent this problem, the {{TaskCancelerWatchDog}} must not call 
> {{interrupt()}} at all, but only join on the executing thread (with timeout) 
> and cause a hard exit once cancellation takes to long.
> A user affected by this problem reported this in FLINK-8834
> Personal note: The Thread.join(...) method unfortunately is not 100% reliable 
> as well, because it uses {{System.currentTimeMillis()}} rather than 
> {{System.nanoTime()}}. Because of that, sleeps can take overly long when the 
> clock is adjusted. I wonder why the JDK authors do not follow their own 
> recommendations and use {{System.nanoTime()}} for all relative time 
> measures...
> EDIT: I am not the only one wondering why: 
> https://stackoverflow.com/questions/42544387/why-does-thread-join-use-currenttimemillis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8856) Move all interrupt() calls to TaskCanceler

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391297#comment-16391297
 ] 

ASF GitHub Bot commented on FLINK-8856:
---

Github user aljoscha commented on a diff in the pull request:

https://github.com/apache/flink/pull/5658#discussion_r173171090
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/DispatcherThreadFactory.java
 ---
@@ -18,6 +18,8 @@
 
 package org.apache.flink.runtime.taskmanager;
--- End diff --

nit: the commit message for this one has types: "hanle" and "esure" (feel 
free to ignore)


> Move all interrupt() calls to TaskCanceler
> --
>
> Key: FLINK-8856
> URL: https://issues.apache.org/jira/browse/FLINK-8856
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.5.0, 1.6.0
>
>
> We need this to work around the following JVM bug: 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8138622
> To circumvent this problem, the {{TaskCancelerWatchDog}} must not call 
> {{interrupt()}} at all, but only join on the executing thread (with timeout) 
> and cause a hard exit once cancellation takes to long.
> A user affected by this problem reported this in FLINK-8834
> Personal note: The Thread.join(...) method unfortunately is not 100% reliable 
> as well, because it uses {{System.currentTimeMillis()}} rather than 
> {{System.nanoTime()}}. Because of that, sleeps can take overly long when the 
> clock is adjusted. I wonder why the JDK authors do not follow their own 
> recommendations and use {{System.nanoTime()}} for all relative time 
> measures...
> EDIT: I am not the only one wondering why: 
> https://stackoverflow.com/questions/42544387/why-does-thread-join-use-currenttimemillis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8856) Move all interrupt() calls to TaskCanceler

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391075#comment-16391075
 ] 

ASF GitHub Bot commented on FLINK-8856:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/5658
  
CI failure unrelated, due to 
https://issues.apache.org/jira/browse/FLINK-8882


> Move all interrupt() calls to TaskCanceler
> --
>
> Key: FLINK-8856
> URL: https://issues.apache.org/jira/browse/FLINK-8856
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.5.0, 1.6.0
>
>
> We need this to work around the following JVM bug: 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8138622
> To circumvent this problem, the {{TaskCancelerWatchDog}} must not call 
> {{interrupt()}} at all, but only join on the executing thread (with timeout) 
> and cause a hard exit once cancellation takes to long.
> A user affected by this problem reported this in FLINK-8834
> Personal note: The Thread.join(...) method unfortunately is not 100% reliable 
> as well, because it uses {{System.currentTimeMillis()}} rather than 
> {{System.nanoTime()}}. Because of that, sleeps can take overly long when the 
> clock is adjusted. I wonder why the JDK authors do not follow their own 
> recommendations and use {{System.nanoTime()}} for all relative time 
> measures...
> EDIT: I am not the only one wondering why: 
> https://stackoverflow.com/questions/42544387/why-does-thread-join-use-currenttimemillis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8856) Move all interrupt() calls to TaskCanceler

2018-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389860#comment-16389860
 ] 

ASF GitHub Bot commented on FLINK-8856:
---

GitHub user StephanEwen opened a pull request:

https://github.com/apache/flink/pull/5658

 [FLINK-8856] [TaskManager] Move all cancellation interrupt calls to 
TaskCanceller thread

## What is the purpose of the change

This cleans up the code and guards against a JVM bug where `interrupt()` 
calls
block/deadlock if the thread is engaged in certain I/O operations.

In addition, this makes sure that the process really goes away when the 
cancellation
timeout expires, rather than relying on the TaskManager to be able to 
properly handle
the fatal error notification.

Some minor robustness enhancements related to this change are included in 
this PR.

## Verifying this change

The change is motivated by an occasional JVM bug that I could not 
purposefully trigger in tests to guard against rollback to the prior state. All 
tests were passing prior to this change and are passing after this change.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)
  - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/StephanEwen/incubator-flink fix_task_interrupt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5658.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5658


commit 36726fc5c277c649dc360975a34bf7be0afd7a0e
Author: Stephan Ewen 
Date:   2018-03-06T14:54:13Z

[hotfix] [taskmanager] Fix checkstyle in Task and TaskTest

commit 385b2032bcaa397d21d28e90efa89a44e12ebe99
Author: Stephan Ewen 
Date:   2018-03-06T14:18:33Z

[FLINK-8856] [TaskManager] Move all cancellation interrupt calls to 
TaskCanceller thread

This cleans up the code and guards against a JVM bug where 'interrupt()' 
calls
block/deadlock if the thread is engaged in certain I/O operations.

In addition, this makes sure that the process really goes away when the 
cancellation
timeout expires, rather than relying on the TaskManager to be able to 
properly handle
the fatal error notification.

commit 3b18d7d9eccc936dc53b05b787f3fd4c19171d4f
Author: Stephan Ewen 
Date:   2018-03-06T15:36:13Z

[FLINK-8883] [core] Make ThreadDeath a fatal error in ExceptionUtils

commit f3884088b210a061dba4d83323884bece1d31864
Author: Stephan Ewen 
Date:   2018-03-06T16:14:54Z

[FLINK-8885] [TaskManager] DispatcherThreadFactory registers a fatal error 
exception handler

In case dispatcher threads let an exception bubble out (does not hanle it), 
the
exception handler terminates the process, to esure we don't leave broken 
processes.

commit 5236bb73a2482ccdf016d1b9bea5cd0f17f2f620
Author: Stephan Ewen 
Date:   2018-03-06T16:18:38Z

[hotfix] [runtime] Harden FatalExitExceptionHandler

In case the logging framework throws an exception when handling the 
exception,
we still kill the process, as intended.




> Move all interrupt() calls to TaskCanceler
> --
>
> Key: FLINK-8856
> URL: https://issues.apache.org/jira/browse/FLINK-8856
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.5.0, 1.6.0
>
>
> We need this to work around the following JVM bug: 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8138622
> To circumvent this problem, the {{TaskCancelerWatchDog}} must not call 
> {{interrupt()}} at all, but only join on the executing thread (with timeout) 
> and cause a hard exit once cancellation takes to long.
> A user affected by this problem reported this in FLINK-8834
> Personal note: The Thread.join(...) method