Hi,

There are multiple issue reports about this or similar issues on GitHub/Jira but all of them without any proper solution, so maybe you can help me.

I am running Beam on Flink (using the Portable runner via Beam's Flink job server) and when something takes a bit longer than expected or the shuffle size gets a bit larger, my workers keep failing randomly with the following error:

E0922 08:50:52.814447061     222 chttp2_transport.cc:1167] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"

I have already tried adding

("grpc.http2.max_pings_without_data", 0),
("grpc.http2.max_ping_strikes", 0)

to DEFAULT_OPTIONS in sdks/python/apache_beam/runners/worker/channel_factory.py, but without success. Are there any other places where gRPC connections are established that need these extra options? Are there any other options that I overlooked?

The most relevant (unsolved) issue report is probably this one here: https://github.com/apache/beam/issues/21598

This issue is pretty serious, since it pretty much prevents me from running jobs with more than a handful of workers or large data.

Many thanks
Janek

Reply via email to