Tuning BigQueryIO.Write

Jeff Klukas Tue, 22 Jan 2019 12:29:22 -0800

I'm attempting to deploy a fairly simple job on the Dataflow runner that
reads from PubSub and writes to BigQuery using file loads, but I have so
far not been able to tune it to keep up with the incoming data rate.


I have configured BigQueryIO.write to trigger loads every 5 minutes, and
I've let the job autoscale up to a max of 60 workers (which it has done).
I'm using dynamic destinations to hit 2 field-partitioned tables. Incoming
data per table is ~10k events/second, so every 5 minutes each table should
be ingesting on order 200k records of ~20 kB apiece.

We don't get many knobs to turn in BigQueryIO. I have tested numShards
between 10 and 1000, but haven't seen obvious differences in performance.

Potentially relevant: I see a high rate of warnings on the shuffler. They
consist mostly of LevelDB warnings about "Too many L0 files". There are
occasionally some other warnings relating to memory as well. Would using
larger workers potentially help?

Does anybody have experience with tuning BigQueryIO writing? It's quite a
complicated transform under the hood and it looks like there are several
steps of grouping and shuffling data that could be limiting throughput.

Tuning BigQueryIO.Write

Reply via email to