[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack
[ https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218858#comment-17218858 ] Arvid Heise commented on FLINK-19688: - Merged into master as 840e8af879e69c1bf9ad121b670a2703eb88b858. Closing issue as resolved. > Flink batch job fails because of InterruptedExceptions from network stack > - > > Key: FLINK-19688 > URL: https://issues.apache.org/jira/browse/FLINK-19688 > Project: Flink > Issue Type: Bug > Components: Runtime / Network, Runtime / Task >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Assignee: Roman Khachatryan >Priority: Blocker > Fix For: 1.12.0 > > Attachments: logs.tgz > > > I have a benchmarking test job, that throws RuntimeExceptions at any operator > at a configured, random interval. When using low intervals, such as mean > failure rate = 60 s, the job will get into a state where it frequently fails > with InterruptedExceptions. > The same job does not have this problem on Flink 1.11.2 (at least not after > running the job for 15 hours, on 1.12-SN, it happens within a few minutes) > This is the job: > https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java > This is the exception: > {code} > 2020-10-16 16:02:15,653 WARN org.apache.flink.runtime.taskmanager.Task > [] - CHAIN GroupReduce (GroupReduce at > main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38)) (8/8)#1 > (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) > switched from RUNNING to FAILED. > java.lang.Exception: The data preparation for task 'CHAIN GroupReduce > (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38))' , caused an error: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222] > Caused by: org.apache.flink.util.WrappingRuntimeException: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > ... 4 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Error obtaining the sorted input: Thread > 'SortMerger Reading Thread' terminated due to an exception: Connection for > partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > ~[?:1.8.0_222] > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > ~[?:1.8.0_222] > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at
[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack
[ https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218855#comment-17218855 ] Roman Khachatryan commented on FLINK-19688: --- https://github.com/apache/flink/pull/13723 > Flink batch job fails because of InterruptedExceptions from network stack > - > > Key: FLINK-19688 > URL: https://issues.apache.org/jira/browse/FLINK-19688 > Project: Flink > Issue Type: Bug > Components: Runtime / Network, Runtime / Task >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Assignee: Roman Khachatryan >Priority: Blocker > Fix For: 1.12.0 > > Attachments: logs.tgz > > > I have a benchmarking test job, that throws RuntimeExceptions at any operator > at a configured, random interval. When using low intervals, such as mean > failure rate = 60 s, the job will get into a state where it frequently fails > with InterruptedExceptions. > The same job does not have this problem on Flink 1.11.2 (at least not after > running the job for 15 hours, on 1.12-SN, it happens within a few minutes) > This is the job: > https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java > This is the exception: > {code} > 2020-10-16 16:02:15,653 WARN org.apache.flink.runtime.taskmanager.Task > [] - CHAIN GroupReduce (GroupReduce at > main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38)) (8/8)#1 > (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) > switched from RUNNING to FAILED. > java.lang.Exception: The data preparation for task 'CHAIN GroupReduce > (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38))' , caused an error: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222] > Caused by: org.apache.flink.util.WrappingRuntimeException: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > ... 4 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Error obtaining the sorted input: Thread > 'SortMerger Reading Thread' terminated due to an exception: Connection for > partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > ~[?:1.8.0_222] > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > ~[?:1.8.0_222] > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at