Probably you just need to break fusion by introducing a Reshuffle [1]
transform after the ReadHTTPBeam step. The way the pipeline is
structured currently, Dataflow will fuse everything into a single step
and will run in a single worker.
That's most likely the issue. I had the same thing on Flink and assumed
that Dataflow would support auto workload redistribution. So this seems
to be a general restriction and not specific to Flink after all? This is
definitely something that needs to be fixed urgently.
I solved it on our end by writing custom input transforms for our
processing library. It's in Python, not Go, but perhaps it's still
useful to you. Here's the one for reading WARC files:
https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/develop/resiliparse/resiliparse/beam/warcio.py
General documentation:
https://resiliparse.chatnoir.eu/en/stable/api/beam/warcio.html
Janek