Probably you just need to break fusion by introducing a Reshuffle [1] transform after the ReadHTTPBeam step. The way the pipeline is structured currently, Dataflow will fuse everything into a single step and will run in a single worker.

That's most likely the issue. I had the same thing on Flink and assumed that Dataflow would support auto workload redistribution. So this seems to be a general restriction and not specific to Flink after all? This is definitely something that needs to be fixed urgently.

I solved it on our end by writing custom input transforms for our processing library. It's in Python, not Go, but perhaps it's still useful to you. Here's the one for reading WARC files: https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/develop/resiliparse/resiliparse/beam/warcio.py

General documentation: https://resiliparse.chatnoir.eu/en/stable/api/beam/warcio.html

Janek

Reply via email to