[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack

2020-10-22 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218858#comment-17218858
 ] 

Arvid Heise commented on FLINK-19688:
-

Merged into master as 840e8af879e69c1bf9ad121b670a2703eb88b858.
Closing issue as resolved.

> Flink batch job fails because of InterruptedExceptions from network stack
> -
>
> Key: FLINK-19688
> URL: https://issues.apache.org/jira/browse/FLINK-19688
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Runtime / Task
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Assignee: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.12.0
>
> Attachments: logs.tgz
>
>
> I have a benchmarking test job, that throws RuntimeExceptions at any operator 
> at a configured, random interval. When using low intervals, such as mean 
> failure rate = 60 s, the job will get into a state where it frequently fails 
> with InterruptedExceptions.
> The same job does not have this problem on Flink 1.11.2 (at least not after 
> running the job for 15 hours, on 1.12-SN, it happens within a few minutes)
> This is the job: 
> https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java
> This is the exception:
> {code}
> 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task   
>  [] - CHAIN GroupReduce (GroupReduce at 
> main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38)) (8/8)#1 
> (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
> switched from RUNNING to FAILED.
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
> (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38))' , caused an error: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> Caused by: org.apache.flink.util.WrappingRuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   ... 4 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error obtaining the sorted input: Thread 
> 'SortMerger Reading Thread' terminated due to an exception: Connection for 
> partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> ~[?:1.8.0_222]
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
> ~[?:1.8.0_222]
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 

[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack

2020-10-22 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218855#comment-17218855
 ] 

Roman Khachatryan commented on FLINK-19688:
---

https://github.com/apache/flink/pull/13723

> Flink batch job fails because of InterruptedExceptions from network stack
> -
>
> Key: FLINK-19688
> URL: https://issues.apache.org/jira/browse/FLINK-19688
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Runtime / Task
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Assignee: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.12.0
>
> Attachments: logs.tgz
>
>
> I have a benchmarking test job, that throws RuntimeExceptions at any operator 
> at a configured, random interval. When using low intervals, such as mean 
> failure rate = 60 s, the job will get into a state where it frequently fails 
> with InterruptedExceptions.
> The same job does not have this problem on Flink 1.11.2 (at least not after 
> running the job for 15 hours, on 1.12-SN, it happens within a few minutes)
> This is the job: 
> https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java
> This is the exception:
> {code}
> 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task   
>  [] - CHAIN GroupReduce (GroupReduce at 
> main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38)) (8/8)#1 
> (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
> switched from RUNNING to FAILED.
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
> (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38))' , caused an error: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> Caused by: org.apache.flink.util.WrappingRuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   ... 4 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error obtaining the sorted input: Thread 
> 'SortMerger Reading Thread' terminated due to an exception: Connection for 
> partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> ~[?:1.8.0_222]
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
> ~[?:1.8.0_222]
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at