BigQueryIO streaming inserts - poor performance with multiple tables

Josh Tue, 27 Feb 2018 23:57:02 -0800

Hi all,

We are using BigQueryIO.write() to stream data into BigQuery, and are
seeing very poor performance in terms of number of writes per second per
worker.


We are currently using *32* x *n1-standard-4* workers to stream ~15,000
writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
number of workers and worker CPU utilisation remains constant at ~90% even
when the rate of input fluctuates down to below 10,000 writes/sec. The job
always keeps up with the stream (no backlog).

I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
with a single node, when streaming data into a *single* BQ table... So my
theory is that writing to multiple tables is somehow causing the
performance issue. Our writes are spread (unevenly) across 200+ tables. The
job itself does very little processing, and looking at the Dataflow metrics
pretty much all of the wall time is spent in the *StreamingWrite* step of
BigQueryIO. The Beam version is 2.2.0.

Our code looks like this:

stream.apply(BigQueryIO.<MyElement>write()
    .to(new ToDestination())
    .withFormatFunction(new FormatForBigQuery())
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

where ToDestination is a:

SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>

which returns a:

new TableDestination(tableName, "")

where tableName looks like "myproject:dataset.tablename$20180228"

Has as anyone else seen this kind of poor performance when streaming
writes to multiple BQ tables? Is there anything here that sounds
wrong, or any optimisations we can make?

Thanks for any advice!

Josh

BigQueryIO streaming inserts - poor performance with multiple tables

Reply via email to