Could be a DataflowRunner specific issue. Would you mind reporting this with corresponding Dataflow job IDs to either Dataflow stackoverflow channel [1] or [email protected] ?
I suspect Dataflow split writing to multiple tables into multiple workers which may be keep all workers busy but we have to look at the job to confirm. Thanks, Cham [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow On Tue, Feb 27, 2018 at 11:56 PM Josh <[email protected]> wrote: > Hi all, > > We are using BigQueryIO.write() to stream data into BigQuery, and are > seeing very poor performance in terms of number of writes per second per > worker. > > We are currently using *32* x *n1-standard-4* workers to stream ~15,000 > writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the > number of workers and worker CPU utilisation remains constant at ~90% even > when the rate of input fluctuates down to below 10,000 writes/sec. The job > always keeps up with the stream (no backlog). > > I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved > with a single node, when streaming data into a *single* BQ table... So my > theory is that writing to multiple tables is somehow causing the > performance issue. Our writes are spread (unevenly) across 200+ tables. The > job itself does very little processing, and looking at the Dataflow metrics > pretty much all of the wall time is spent in the *StreamingWrite* step of > BigQueryIO. The Beam version is 2.2.0. > > Our code looks like this: > > stream.apply(BigQueryIO.<MyElement>write() > .to(new ToDestination()) > .withFormatFunction(new FormatForBigQuery()) > .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); > > where ToDestination is a: > > SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination> > > which returns a: > > new TableDestination(tableName, "") > > where tableName looks like "myproject:dataset.tablename$20180228" > > Has as anyone else seen this kind of poor performance when streaming writes > to multiple BQ tables? Is there anything here that sounds wrong, or any > optimisations we can make? > > Thanks for any advice! > > Josh >
