Re: BigQueryIO streaming inserts - poor performance with multiple tables

Carlos Alonso Tue, 06 Mar 2018 01:36:57 -0800

Could you please keep writing here the findings you make?

I'm very interested in this issue as well.


Thanks!

On Thu, Mar 1, 2018 at 9:45 AM Josh <[email protected]> wrote:

> Hi Cham,
>
> Thanks, I have emailed the dataflow-feedback email address with the
> details.
>
> Best regards,
> Josh
>
> On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath <[email protected]>
> wrote:
>
>> Could be a DataflowRunner specific issue. Would you mind reporting this
>> with corresponding Dataflow job IDs to either Dataflow stackoverflow
>> channel [1] or [email protected] ?
>>
>> I suspect Dataflow split writing to multiple tables into multiple workers
>> which may be keep all workers busy but we have to look at the job to
>> confirm.
>>
>> Thanks,
>> Cham
>>
>> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow
>>
>> On Tue, Feb 27, 2018 at 11:56 PM Josh <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> We are using BigQueryIO.write() to stream data into BigQuery, and are
>>> seeing very poor performance in terms of number of writes per second per
>>> worker.
>>>
>>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
>>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
>>> number of workers and worker CPU utilisation remains constant at ~90% even
>>> when the rate of input fluctuates down to below 10,000 writes/sec. The job
>>> always keeps up with the stream (no backlog).
>>>
>>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being
>>> achieved with a single node, when streaming data into a *single* BQ
>>> table... So my theory is that writing to multiple tables is somehow causing
>>> the performance issue. Our writes are spread (unevenly) across 200+ tables.
>>> The job itself does very little processing, and looking at the Dataflow
>>> metrics pretty much all of the wall time is spent in the
>>> *StreamingWrite* step of BigQueryIO. The Beam version is 2.2.0.
>>>
>>> Our code looks like this:
>>>
>>> stream.apply(BigQueryIO.<MyElement>write()
>>>     .to(new ToDestination())
>>>     .withFormatFunction(new FormatForBigQuery())
>>>     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>>     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>>>
>>> where ToDestination is a:
>>>
>>> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>
>>>
>>> which returns a:
>>>
>>> new TableDestination(tableName, "")
>>>
>>> where tableName looks like "myproject:dataset.tablename$20180228"
>>>
>>> Has as anyone else seen this kind of poor performance when streaming writes 
>>> to multiple BQ tables? Is there anything here that sounds wrong, or any 
>>> optimisations we can make?
>>>
>>> Thanks for any advice!
>>>
>>> Josh
>>>
>>
>

Re: BigQueryIO streaming inserts - poor performance with multiple tables

Reply via email to