To let Dataflow choose the optimal number shards and maximize performance, it's often significantly better to simply leave it unspecified. A higher numShards only helps if you have at least that many workers.
On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <vivek....@gmail.com> wrote: > > hi folks, I have this in code > > globalIndexJson.apply("GCSOutput", > TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500)); > > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if > changing numShards for larger datasize will write to GCS faster?