Thanks a lot for the explanation Eugene. I will try low values. On Fri, Mar 9, 2018 at 7:03 AM, Eugene Kirpichov <[email protected]> wrote:
> It's unfortunate that we have this parameter at all - we discussed various > ways to get rid of it with +Reuven Lax <[email protected]> , ideally we'd > be computing it automatically . In your case the throughput is quite modest > and even a value of 1 should do well. > > Basically in this codepath we write the data to files in parallel, and > every $triggeringFrequency we flush the files to a BigQuery load job. How > many files to write in parallel, depends on the throughput. The fewer, the > better, but the write throughput to a single file is limited. You can > assume that write throughput to GCS is a few dozen MB/s per file; I assume > 1000 events/s fits under that, depending on the event size. > > Actually with that in mind, we should probably just set the value to > something like 10 or 100 which will be enough for most needs (up to about 5 > GB/s) but keep it configurable for people who need more, and eventually > figure out a way to autoscale it. > > On Thu, Mar 8, 2018 at 1:50 AM Jose Ignacio Honrado <[email protected]> > wrote: > >> Hi, >> >> I am using BigQueryIO from Apache Beam 2.3.0 and Scio 0.47 to load data >> into BQ from Dataflow using jobs (Write.Method.FILE_LOADS). Here is the >> code: >> >> val timePartitioning = new TimePartitioning().setField(" >> partition_day").setType("DAY") >> >> BigQueryIO.write[Event] >> .to("some-table") >> .withCreateDisposition(Write.CreateDisposition.CREATE_IF_NEEDED) >> .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND) >> .withMethod(Write.Method.FILE_LOADS) >> .withFormatFunction((input: Event) => BigQueryType[Event]. >> toTableRow(input)) >> .withSchema(BigQueryType[Event].schema) >> .withTriggeringFrequency(Duration.standardMinutes(15)) >> .withNumFileShards(XXX) >> .withTimePartitioning(timePartitioning) >> >> My question is related to the "numFileShards", which is a mandatory >> parameter to set when using a "triggeringFrequency". I have been trying to >> find information and reading the source code to understand what it does but >> I couldn't find anything relevant. >> >> Considering there is gonna be a throughput of 300-1000 events per second, >> what would be the recommended value? >> >> Thanks! >> >
