Re: BigQueryIO streaming inserts - poor performance with multiple tables

Chamikara Jayalath Wed, 28 Feb 2018 16:27:06 -0800

Could be a DataflowRunner specific issue. Would you mind reporting this
with corresponding Dataflow job IDs to either Dataflow stackoverflow
channel [1] or [email protected] ?


I suspect Dataflow split writing to multiple tables into multiple workers
which may be keep all workers busy but we have to look at the job to
confirm.

Thanks,
Cham

[1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow

On Tue, Feb 27, 2018 at 11:56 PM Josh <[email protected]> wrote:

> Hi all,
>
> We are using BigQueryIO.write() to stream data into BigQuery, and are
> seeing very poor performance in terms of number of writes per second per
> worker.
>
> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
> number of workers and worker CPU utilisation remains constant at ~90% even
> when the rate of input fluctuates down to below 10,000 writes/sec. The job
> always keeps up with the stream (no backlog).
>
> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
> with a single node, when streaming data into a *single* BQ table... So my
> theory is that writing to multiple tables is somehow causing the
> performance issue. Our writes are spread (unevenly) across 200+ tables. The
> job itself does very little processing, and looking at the Dataflow metrics
> pretty much all of the wall time is spent in the *StreamingWrite* step of
> BigQueryIO. The Beam version is 2.2.0.
>
> Our code looks like this:
>
> stream.apply(BigQueryIO.<MyElement>write()
>     .to(new ToDestination())
>     .withFormatFunction(new FormatForBigQuery())
>     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>
> where ToDestination is a:
>
> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>
>
> which returns a:
>
> new TableDestination(tableName, "")
>
> where tableName looks like "myproject:dataset.tablename$20180228"
>
> Has as anyone else seen this kind of poor performance when streaming writes 
> to multiple BQ tables? Is there anything here that sounds wrong, or any 
> optimisations we can make?
>
> Thanks for any advice!
>
> Josh
>

Re: BigQueryIO streaming inserts - poor performance with multiple tables

Reply via email to