Could you please keep writing here the findings you make? I'm very interested in this issue as well.
Thanks! On Thu, Mar 1, 2018 at 9:45 AM Josh <[email protected]> wrote: > Hi Cham, > > Thanks, I have emailed the dataflow-feedback email address with the > details. > > Best regards, > Josh > > On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath <[email protected]> > wrote: > >> Could be a DataflowRunner specific issue. Would you mind reporting this >> with corresponding Dataflow job IDs to either Dataflow stackoverflow >> channel [1] or [email protected] ? >> >> I suspect Dataflow split writing to multiple tables into multiple workers >> which may be keep all workers busy but we have to look at the job to >> confirm. >> >> Thanks, >> Cham >> >> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow >> >> On Tue, Feb 27, 2018 at 11:56 PM Josh <[email protected]> wrote: >> >>> Hi all, >>> >>> We are using BigQueryIO.write() to stream data into BigQuery, and are >>> seeing very poor performance in terms of number of writes per second per >>> worker. >>> >>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000 >>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the >>> number of workers and worker CPU utilisation remains constant at ~90% even >>> when the rate of input fluctuates down to below 10,000 writes/sec. The job >>> always keeps up with the stream (no backlog). >>> >>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being >>> achieved with a single node, when streaming data into a *single* BQ >>> table... So my theory is that writing to multiple tables is somehow causing >>> the performance issue. Our writes are spread (unevenly) across 200+ tables. >>> The job itself does very little processing, and looking at the Dataflow >>> metrics pretty much all of the wall time is spent in the >>> *StreamingWrite* step of BigQueryIO. The Beam version is 2.2.0. >>> >>> Our code looks like this: >>> >>> stream.apply(BigQueryIO.<MyElement>write() >>> .to(new ToDestination()) >>> .withFormatFunction(new FormatForBigQuery()) >>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) >>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); >>> >>> where ToDestination is a: >>> >>> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination> >>> >>> which returns a: >>> >>> new TableDestination(tableName, "") >>> >>> where tableName looks like "myproject:dataset.tablename$20180228" >>> >>> Has as anyone else seen this kind of poor performance when streaming writes >>> to multiple BQ tables? Is there anything here that sounds wrong, or any >>> optimisations we can make? >>> >>> Thanks for any advice! >>> >>> Josh >>> >> >
