Hi Beamers,

We are facing OutOfMemory errors with a streaming pipeline on Dataflow. We
are unable to get rid of them, not even with bigger worker instances. Any
advice will be appreciated.

The scenario is the following.
- Files are being written to a bucket in GCS. A notification is set up on
the bucket, so for each file there is a Pub/Sub message
- Notifications are consumed by our pipeline
- Files from notifications are read by the pipeline
- Each file contains several events (there is a quite big fanout)
- Events are written to BigQuery (streaming)

Everything works fine if there are only few notifications, but if the files
are incoming at high rate or if there is a large backlog in the
notification subscription, events get stuck in BQ write and later OOM is
thrown.

Having a larger worker does not work because if the backlog is large,
larger instance(s) will throw OOM as well, just later...

As far as we can see, there is a checkpointing/reshuffle in BQ write and
thats where the elements got stuck. It looks like the pubsub is consuming
too many elements and due to fanout it causes OOM when grouping in
reshuffle.
Is there any way to backpressure the pubsub read? Is it possible to have
smaller window in Reshuffle? How does the Reshuffle actually work?
Any advice?

Thanks in advance,
Frantisek

Reply via email to