Re: GCS numShards doubt
For bounded data, each bundle becomes a file: https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356 Kenn On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver wrote: > As Luke and Robert indicated, unsetting num shards _may_ cause the runner > to optimize it automatically. > > For example, the Flink [1] and Dataflow [2] runners override num shards. > > However, in the Spark runner, I don't see any such override. So I have two > questions: > 1. Does the Spark runner override num shards somehow? > 2. How is num shards determined if it's set to 0 and not overridden by the > runner? > > [1] > https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243 > [2] > https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866 > > On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw > wrote: > >> To let Dataflow choose the optimal number shards and maximize >> performance, it's often significantly better to simply leave it >> unspecified. A higher numShards only helps if you have at least that >> many workers. >> >> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya >> wrote: >> > >> > hi folks, I have this in code >> > >> > globalIndexJson.apply("GCSOutput", >> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500)); >> > >> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if >> changing numShards for larger datasize will write to GCS faster? >> >
Re: GCS numShards doubt
As Luke and Robert indicated, unsetting num shards _may_ cause the runner to optimize it automatically. For example, the Flink [1] and Dataflow [2] runners override num shards. However, in the Spark runner, I don't see any such override. So I have two questions: 1. Does the Spark runner override num shards somehow? 2. How is num shards determined if it's set to 0 and not overridden by the runner? [1] https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243 [2] https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866 On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw wrote: > To let Dataflow choose the optimal number shards and maximize > performance, it's often significantly better to simply leave it > unspecified. A higher numShards only helps if you have at least that > many workers. > > On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya > wrote: > > > > hi folks, I have this in code > > > > globalIndexJson.apply("GCSOutput", > TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500)); > > > > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if > changing numShards for larger datasize will write to GCS faster? >
Re: GCS numShards doubt
To let Dataflow choose the optimal number shards and maximize performance, it's often significantly better to simply leave it unspecified. A higher numShards only helps if you have at least that many workers. On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya wrote: > > hi folks, I have this in code > > globalIndexJson.apply("GCSOutput", > TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500)); > > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if > changing numShards for larger datasize will write to GCS faster?
Re: GCS numShards doubt
Prefer to never specify num shards since this allows the runner the greatest flexibility in how it executes and is the most performant as well. Increasing num shards enables more workers to do the work in parallel but there is no guarantee that it will be significantly faster since you could have 5 workers. On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya wrote: > hi folks, I have this in code > > *globalIndexJson.apply("GCSOutput", > TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));* > > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if > changing numShards for larger datasize will write to GCS faster? >