Hi Cham, Thanks, I have emailed the dataflow-feedback email address with the details.
Best regards, Josh On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath <[email protected]> wrote: > Could be a DataflowRunner specific issue. Would you mind reporting this > with corresponding Dataflow job IDs to either Dataflow stackoverflow > channel [1] or [email protected] ? > > I suspect Dataflow split writing to multiple tables into multiple workers > which may be keep all workers busy but we have to look at the job to > confirm. > > Thanks, > Cham > > [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow > > On Tue, Feb 27, 2018 at 11:56 PM Josh <[email protected]> wrote: > >> Hi all, >> >> We are using BigQueryIO.write() to stream data into BigQuery, and are >> seeing very poor performance in terms of number of writes per second per >> worker. >> >> We are currently using *32* x *n1-standard-4* workers to stream ~15,000 >> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the >> number of workers and worker CPU utilisation remains constant at ~90% even >> when the rate of input fluctuates down to below 10,000 writes/sec. The job >> always keeps up with the stream (no backlog). >> >> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved >> with a single node, when streaming data into a *single* BQ table... So >> my theory is that writing to multiple tables is somehow causing the >> performance issue. Our writes are spread (unevenly) across 200+ tables. The >> job itself does very little processing, and looking at the Dataflow metrics >> pretty much all of the wall time is spent in the *StreamingWrite* step >> of BigQueryIO. The Beam version is 2.2.0. >> >> Our code looks like this: >> >> stream.apply(BigQueryIO.<MyElement>write() >> .to(new ToDestination()) >> .withFormatFunction(new FormatForBigQuery()) >> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) >> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); >> >> where ToDestination is a: >> >> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination> >> >> which returns a: >> >> new TableDestination(tableName, "") >> >> where tableName looks like "myproject:dataset.tablename$20180228" >> >> Has as anyone else seen this kind of poor performance when streaming writes >> to multiple BQ tables? Is there anything here that sounds wrong, or any >> optimisations we can make? >> >> Thanks for any advice! >> >> Josh >> >
