Could you please keep writing here the findings you make?
I'm very interested in this issue as well.
On Thu, Mar 1, 2018 at 9:45 AM Josh <jof...@gmail.com> wrote:
> Hi Cham,
> Thanks, I have emailed the dataflow-feedback email address with the
> Best regards,
> On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath <chamik...@google.com>
>> Could be a DataflowRunner specific issue. Would you mind reporting this
>> with corresponding Dataflow job IDs to either Dataflow stackoverflow
>> channel  or dataflow-feedb...@google.com ?
>> I suspect Dataflow split writing to multiple tables into multiple workers
>> which may be keep all workers busy but we have to look at the job to
>>  https://stackoverflow.com/questions/tagged/google-cloud-dataflow
>> On Tue, Feb 27, 2018 at 11:56 PM Josh <jof...@gmail.com> wrote:
>>> Hi all,
>>> We are using BigQueryIO.write() to stream data into BigQuery, and are
>>> seeing very poor performance in terms of number of writes per second per
>>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
>>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
>>> number of workers and worker CPU utilisation remains constant at ~90% even
>>> when the rate of input fluctuates down to below 10,000 writes/sec. The job
>>> always keeps up with the stream (no backlog).
>>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being
>>> achieved with a single node, when streaming data into a *single* BQ
>>> table... So my theory is that writing to multiple tables is somehow causing
>>> the performance issue. Our writes are spread (unevenly) across 200+ tables.
>>> The job itself does very little processing, and looking at the Dataflow
>>> metrics pretty much all of the wall time is spent in the
>>> *StreamingWrite* step of BigQueryIO. The Beam version is 2.2.0.
>>> Our code looks like this:
>>> .to(new ToDestination())
>>> .withFormatFunction(new FormatForBigQuery())
>>> where ToDestination is a:
>>> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>
>>> which returns a:
>>> new TableDestination(tableName, "")
>>> where tableName looks like "myproject:dataset.tablename$20180228"
>>> Has as anyone else seen this kind of poor performance when streaming writes
>>> to multiple BQ tables? Is there anything here that sounds wrong, or any
>>> optimisations we can make?
>>> Thanks for any advice!