Thanks a lot for the explanation Eugene. I will try low values.

On Fri, Mar 9, 2018 at 7:03 AM, Eugene Kirpichov <[email protected]>
wrote:

> It's unfortunate that we have this parameter at all - we discussed various
> ways to get rid of it with +Reuven Lax <[email protected]> , ideally we'd
> be computing it automatically . In your case the throughput is quite modest
> and even a value of 1 should do well.
>
> Basically in this codepath we write the data to files in parallel, and
> every $triggeringFrequency we flush the files to a BigQuery load job. How
> many files to write in parallel, depends on the throughput. The fewer, the
> better, but the write throughput to a single file is limited. You can
> assume that write throughput to GCS is a few dozen MB/s per file; I assume
> 1000 events/s fits under that, depending on the event size.
>
> Actually with that in mind, we should probably just set the value to
> something like 10 or 100 which will be enough for most needs (up to about 5
> GB/s) but keep it configurable for people who need more, and eventually
> figure out a way to autoscale it.
>
> On Thu, Mar 8, 2018 at 1:50 AM Jose Ignacio Honrado <[email protected]>
> wrote:
>
>> Hi,
>>
>> I am using BigQueryIO from Apache Beam 2.3.0 and Scio 0.47 to load data
>> into BQ from Dataflow using jobs (Write.Method.FILE_LOADS). Here is the
>> code:
>>
>>     val timePartitioning = new TimePartitioning().setField("
>> partition_day").setType("DAY")
>>
>>     BigQueryIO.write[Event]
>>       .to("some-table")
>>       .withCreateDisposition(Write.CreateDisposition.CREATE_IF_NEEDED)
>>       .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND)
>>       .withMethod(Write.Method.FILE_LOADS)
>>       .withFormatFunction((input: Event) => BigQueryType[Event].
>> toTableRow(input))
>>       .withSchema(BigQueryType[Event].schema)
>>       .withTriggeringFrequency(Duration.standardMinutes(15))
>>       .withNumFileShards(XXX)
>>       .withTimePartitioning(timePartitioning)
>>
>> My question is related to the "numFileShards", which is a mandatory
>> parameter to set when using a "triggeringFrequency". I have been trying to
>> find information and reading the source code to understand what it does but
>> I couldn't find anything relevant.
>>
>> Considering there is gonna be a throughput of 300-1000 events per second,
>> what would be the recommended value?
>>
>> Thanks!
>>
>

Reply via email to