[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-23 Thread zhijiangW
Github user zhijiangW commented on the issue: https://github.com/apache/flink/pull/3360 @StephanEwen , if the exception is bubbled out, and cause TaskExecutor to exit as a result, I think the JobMaster can be assumed in a sane state in final based on detection of TaskExecutor

[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-22 Thread tillrohrmann
Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3360 Thanks for the clarification @zhijiangW. I know understand the problem that we effectively introduce via `RpcEndpoint.runAsync` another message which might get "lost" (e.g. due to OOM

[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-22 Thread StephanEwen
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3360 Looking at this from another angle: If any Runnable that is scheduled ever lets an exception bubble out, can we still assume that the JobManager is in a sane state? Or should be actually make

[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-21 Thread zhijiangW
Github user zhijiangW commented on the issue: https://github.com/apache/flink/pull/3360 Hi @tillrohrmann , thank you for reviews and positive suggestions! I try to explain the root case of this issue first: From JobMaster side, it sends the cancel rpc message and gets

[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-21 Thread tillrohrmann
Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3360 I think adding this safety net makes sense and protects against a corrupted state. However, isn't the root cause of the described problem that the JobMaster-TaskExecutor communication

[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-20 Thread zhijiangW
Github user zhijiangW commented on the issue: https://github.com/apache/flink/pull/3360 @StephanEwen , already submit the modifications. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-20 Thread zhijiangW
Github user zhijiangW commented on the issue: https://github.com/apache/flink/pull/3360 @StephanEwen , thank you for so quick reviews! That is a good idea to add the uniform way in the utils, so we can use that in anywhere. I will fix it as your suggestions later

[GitHub] flink issue #3360: [FLINK-5830][Distributed Coordination] Handle OutOfMemory...

2017-02-20 Thread StephanEwen
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3360 I would suggest that we adopt the following pattern for all the places like the one in this pull request where we catch Throwables: ```java try { ... } catch (Throwable