Hi Cham,

Thanks, I have emailed the dataflow-feedback email address with the details.

Best regards,
Josh

On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath <[email protected]>
wrote:

> Could be a DataflowRunner specific issue. Would you mind reporting this
> with corresponding Dataflow job IDs to either Dataflow stackoverflow
> channel [1] or [email protected] ?
>
> I suspect Dataflow split writing to multiple tables into multiple workers
> which may be keep all workers busy but we have to look at the job to
> confirm.
>
> Thanks,
> Cham
>
> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow
>
> On Tue, Feb 27, 2018 at 11:56 PM Josh <[email protected]> wrote:
>
>> Hi all,
>>
>> We are using BigQueryIO.write() to stream data into BigQuery, and are
>> seeing very poor performance in terms of number of writes per second per
>> worker.
>>
>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
>> number of workers and worker CPU utilisation remains constant at ~90% even
>> when the rate of input fluctuates down to below 10,000 writes/sec. The job
>> always keeps up with the stream (no backlog).
>>
>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
>> with a single node, when streaming data into a *single* BQ table... So
>> my theory is that writing to multiple tables is somehow causing the
>> performance issue. Our writes are spread (unevenly) across 200+ tables. The
>> job itself does very little processing, and looking at the Dataflow metrics
>> pretty much all of the wall time is spent in the *StreamingWrite* step
>> of BigQueryIO. The Beam version is 2.2.0.
>>
>> Our code looks like this:
>>
>> stream.apply(BigQueryIO.<MyElement>write()
>>     .to(new ToDestination())
>>     .withFormatFunction(new FormatForBigQuery())
>>     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>>
>> where ToDestination is a:
>>
>> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>
>>
>> which returns a:
>>
>> new TableDestination(tableName, "")
>>
>> where tableName looks like "myproject:dataset.tablename$20180228"
>>
>> Has as anyone else seen this kind of poor performance when streaming writes 
>> to multiple BQ tables? Is there anything here that sounds wrong, or any 
>> optimisations we can make?
>>
>> Thanks for any advice!
>>
>> Josh
>>
>

Reply via email to