Do you mean the value to specify for number of shards to write [1] ?

For this I think it's better to not specify any value which will give the
runner the most flexibility.

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1455

On Wed, Sep 4, 2019 at 2:42 AM Ziyad Muhammed <[email protected]> wrote:

> Hi all
>
> I have a beam pipeline running with cloud dataflow that produces avro
> files on GCS. Window duration is 1 minute and currently the job is running
> with 64 cores (16 * n1-standard-4). Per minute the data produced is around
> 2GB.
>
> Is there any recommendation on the number of avro files to specify?
> Currently I'm using 64 (to match with the number of cores). Will a very
> high number help in increasing the write throughput?
> I saw that BigqueryIO with FILE_LOADS is using a default value of 1000
> files.
>
> I tried some random values, but couldn't infer a pattern when is it more
> performant.
>
> Any suggestion is hugely appreciated.
>
> Best
> Ziyad
>

Reply via email to