It's unfortunate that we have this parameter at all - we discussed various ways to get rid of it with +Reuven Lax <[email protected]> , ideally we'd be computing it automatically . In your case the throughput is quite modest and even a value of 1 should do well.
Basically in this codepath we write the data to files in parallel, and every $triggeringFrequency we flush the files to a BigQuery load job. How many files to write in parallel, depends on the throughput. The fewer, the better, but the write throughput to a single file is limited. You can assume that write throughput to GCS is a few dozen MB/s per file; I assume 1000 events/s fits under that, depending on the event size. Actually with that in mind, we should probably just set the value to something like 10 or 100 which will be enough for most needs (up to about 5 GB/s) but keep it configurable for people who need more, and eventually figure out a way to autoscale it. On Thu, Mar 8, 2018 at 1:50 AM Jose Ignacio Honrado <[email protected]> wrote: > Hi, > > I am using BigQueryIO from Apache Beam 2.3.0 and Scio 0.47 to load data > into BQ from Dataflow using jobs (Write.Method.FILE_LOADS). Here is the > code: > > val timePartitioning = new > TimePartitioning().setField("partition_day").setType("DAY") > > BigQueryIO.write[Event] > .to("some-table") > .withCreateDisposition(Write.CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND) > .withMethod(Write.Method.FILE_LOADS) > .withFormatFunction((input: Event) => > BigQueryType[Event].toTableRow(input)) > .withSchema(BigQueryType[Event].schema) > .withTriggeringFrequency(Duration.standardMinutes(15)) > .withNumFileShards(XXX) > .withTimePartitioning(timePartitioning) > > My question is related to the "numFileShards", which is a mandatory > parameter to set when using a "triggeringFrequency". I have been trying to > find information and reading the source code to understand what it does but > I couldn't find anything relevant. > > Considering there is gonna be a throughput of 300-1000 events per second, > what would be the recommended value? > > Thanks! >
