Re: BigQueryIO and NumFileShards

Eugene Kirpichov Thu, 08 Mar 2018 22:04:23 -0800

It's unfortunate that we have this parameter at all - we discussed various
ways to get rid of it with +Reuven Lax <[email protected]> , ideally we'd be
computing it automatically . In your case the throughput is quite modest
and even a value of 1 should do well.

Basically in this codepath we write the data to files in parallel, and
every $triggeringFrequency we flush the files to a BigQuery load job. How
many files to write in parallel, depends on the throughput. The fewer, the
better, but the write throughput to a single file is limited. You can
assume that write throughput to GCS is a few dozen MB/s per file; I assume
1000 events/s fits under that, depending on the event size.

Actually with that in mind, we should probably just set the value to
something like 10 or 100 which will be enough for most needs (up to about 5
GB/s) but keep it configurable for people who need more, and eventually
figure out a way to autoscale it.

On Thu, Mar 8, 2018 at 1:50 AM Jose Ignacio Honrado <[email protected]>
wrote:

> Hi,
>
> I am using BigQueryIO from Apache Beam 2.3.0 and Scio 0.47 to load data
> into BQ from Dataflow using jobs (Write.Method.FILE_LOADS). Here is the
> code:
>
>     val timePartitioning = new
> TimePartitioning().setField("partition_day").setType("DAY")
>
>     BigQueryIO.write[Event]
>       .to("some-table")
>       .withCreateDisposition(Write.CreateDisposition.CREATE_IF_NEEDED)
>       .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND)
>       .withMethod(Write.Method.FILE_LOADS)
>       .withFormatFunction((input: Event) =>
> BigQueryType[Event].toTableRow(input))
>       .withSchema(BigQueryType[Event].schema)
>       .withTriggeringFrequency(Duration.standardMinutes(15))
>       .withNumFileShards(XXX)
>       .withTimePartitioning(timePartitioning)
>
> My question is related to the "numFileShards", which is a mandatory
> parameter to set when using a "triggeringFrequency". I have been trying to
> find information and reading the source code to understand what it does but
> I couldn't find anything relevant.
>
> Considering there is gonna be a throughput of 300-1000 events per second,
> what would be the recommended value?
>
> Thanks!
>

Re: BigQueryIO and NumFileShards

Reply via email to