RE: Fanout and OOMs on Dataflow while streaming to BQ

Erik Lincoln Fri, 22 Nov 2019 06:01:59 -0800

Hi Frantisek,

Some advice from making a similar pipeline and struggling with throughput and 
latency:


  1.  Break up your pipeline into multiple pipelines. Dataflow only auto-scales 
based on input throughput. If you’re microbatching events in files, the job 
will only autoscale to meet the volume of files, not the volume of events added 
to the pipeline from the files.
     *   Better flow is:

                                                               i.      Pipeline 
1: Receive GCS notifications, read files, and then output file contents as 
Pubsub messages either per event or in microbatches from the file

                                                             ii.      Pipeline 
2: Receive events from Pubsub, do your transforms, then write to BQ

  1.  Use the Streaming Engine service (if you can run in a region supporting 
it): 
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#streaming-engine
  2.  BQ streaming can be slower than a load job if you have a very high volume 
(millions of events a minute). If your event volume is high, you might need to 
consider further microbatching loads into BQ from GCS.

Hope this helps,
Erik

From: Frantisek Csajka <[email protected]>
Sent: Friday, November 22, 2019 5:35 AM
To: [email protected]
Subject: Fanout and OOMs on Dataflow while streaming to BQ

Hi Beamers,

We are facing OutOfMemory errors with a streaming pipeline on Dataflow. We are 
unable to get rid of them, not even with bigger worker instances. Any advice 
will be appreciated.

The scenario is the following.
- Files are being written to a bucket in GCS. A notification is set up on the 
bucket, so for each file there is a Pub/Sub message
- Notifications are consumed by our pipeline
- Files from notifications are read by the pipeline
- Each file contains several events (there is a quite big fanout)
- Events are written to BigQuery (streaming)

Everything works fine if there are only few notifications, but if the files are 
incoming at high rate or if there is a large backlog in the notification 
subscription, events get stuck in BQ write and later OOM is thrown.

Having a larger worker does not work because if the backlog is large, larger 
instance(s) will throw OOM as well, just later...

As far as we can see, there is a checkpointing/reshuffle in BQ write and thats 
where the elements got stuck. It looks like the pubsub is consuming too many 
elements and due to fanout it causes OOM when grouping in reshuffle.
Is there any way to backpressure the pubsub read? Is it possible to have 
smaller window in Reshuffle? How does the Reshuffle actually work?
Any advice?

Thanks in advance,
Frantisek
CONFIDENTIALITY NOTICE: This e-mail and any files transmitted with it are 
intended solely for the use of the individual or entity to whom they are 
addressed and may contain confidential and privileged information protected by 
law. If you received this e-mail in error, any review, use, dissemination, 
distribution, or copying of the e-mail is strictly prohibited. Please notify 
the sender immediately by return e-mail and delete all copies from your system.

RE: Fanout and OOMs on Dataflow while streaming to BQ

Reply via email to