Hi,

When I run a Python pipeline with multiple concurrent TaskManagers on Flink, the job hardly ever (or never) finishes properly. At the end, Beam (or Flink?) always throws a seemingly random gRPC IllegalStateException after my last GlobalCombine, so Beam goes into some weird error handling mode and eventually fails to job, even though it should have finished cleanly.

This is only reproducible with parallelism set to at least 5 or 8. With 1-4, I cannot reliably (or at all) reproduce it. It looks like a similar issue has already been reported on Jira (https://issues.apache.org/jira/browse/BEAM-8980), but it got marked as stale. Anyone else seeing this? Is there anything I can do? I don't want my job to restart after it's finished and I want a clean exit status, otherwise I don't really know if everything succeeded properly and I don't want to comb through hundreds of log files to find out.

I added a comment with the stacktraces that I get below the above-mentioned issue: https://issues.apache.org/jira/browse/BEAM-8980?focusedCommentId=17483174&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17483174

Janek

Reply via email to