As Luke and Robert indicated, unsetting num shards _may_ cause the runner to optimize it automatically.
For example, the Flink [1] and Dataflow [2] runners override num shards. However, in the Spark runner, I don't see any such override. So I have two questions: 1. Does the Spark runner override num shards somehow? 2. How is num shards determined if it's set to 0 and not overridden by the runner? [1] https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243 [2] https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866 On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw <[email protected]> wrote: > To let Dataflow choose the optimal number shards and maximize > performance, it's often significantly better to simply leave it > unspecified. A higher numShards only helps if you have at least that > many workers. > > On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <[email protected]> > wrote: > > > > hi folks, I have this in code > > > > globalIndexJson.apply("GCSOutput", > TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500)); > > > > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if > changing numShards for larger datasize will write to GCS faster? >
