As Luke and Robert indicated, unsetting num shards _may_ cause the runner
to optimize it automatically.

For example, the Flink [1] and Dataflow [2] runners override num shards.

However, in the Spark runner, I don't see any such override. So I have two
questions:
1. Does the Spark runner override num shards somehow?
2. How is num shards determined if it's set to 0 and not overridden by the
runner?

[1]
https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243
[2]
https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866

On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw <[email protected]>
wrote:

> To let Dataflow choose the optimal number shards and maximize
> performance, it's often significantly better to simply leave it
> unspecified. A higher numShards only helps if you have at least that
> many workers.
>
> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <[email protected]>
> wrote:
> >
> > hi folks, I have this in code
> >
> >             globalIndexJson.apply("GCSOutput",
> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
> >
> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
> changing numShards for larger datasize will write to GCS faster?
>

Reply via email to