Re: GCS numShards doubt

2020-03-02 Thread Kenneth Knowles
For bounded data, each bundle becomes a file:
https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356

Kenn

On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver  wrote:

> As Luke and Robert indicated, unsetting num shards _may_ cause the runner
> to optimize it automatically.
>
> For example, the Flink [1] and Dataflow [2] runners override num shards.
>
> However, in the Spark runner, I don't see any such override. So I have two
> questions:
> 1. Does the Spark runner override num shards somehow?
> 2. How is num shards determined if it's set to 0 and not overridden by the
> runner?
>
> [1]
> https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243
> [2]
> https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866
>
> On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw 
> wrote:
>
>> To let Dataflow choose the optimal number shards and maximize
>> performance, it's often significantly better to simply leave it
>> unspecified. A higher numShards only helps if you have at least that
>> many workers.
>>
>> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya 
>> wrote:
>> >
>> > hi folks, I have this in code
>> >
>> > globalIndexJson.apply("GCSOutput",
>> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
>> >
>> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
>> changing numShards for larger datasize will write to GCS faster?
>>
>


Re: GCS numShards doubt

2020-03-02 Thread Kyle Weaver
As Luke and Robert indicated, unsetting num shards _may_ cause the runner
to optimize it automatically.

For example, the Flink [1] and Dataflow [2] runners override num shards.

However, in the Spark runner, I don't see any such override. So I have two
questions:
1. Does the Spark runner override num shards somehow?
2. How is num shards determined if it's set to 0 and not overridden by the
runner?

[1]
https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243
[2]
https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866

On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw 
wrote:

> To let Dataflow choose the optimal number shards and maximize
> performance, it's often significantly better to simply leave it
> unspecified. A higher numShards only helps if you have at least that
> many workers.
>
> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya 
> wrote:
> >
> > hi folks, I have this in code
> >
> > globalIndexJson.apply("GCSOutput",
> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
> >
> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
> changing numShards for larger datasize will write to GCS faster?
>


Re: GCS numShards doubt

2020-02-14 Thread Robert Bradshaw
To let Dataflow choose the optimal number shards and maximize
performance, it's often significantly better to simply leave it
unspecified. A higher numShards only helps if you have at least that
many workers.

On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya  wrote:
>
> hi folks, I have this in code
>
> globalIndexJson.apply("GCSOutput", 
> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
>
> the same code is executed for 50GB, 3TB, 5TB of data. I want to know if 
> changing numShards for larger datasize will write to GCS faster?


Re: GCS numShards doubt

2020-02-14 Thread Luke Cwik
Prefer to never specify num shards since this allows the runner the
greatest flexibility in how it executes and is the most performant as well.

Increasing num shards enables more workers to do the work in parallel but
there is no guarantee that it will be significantly faster since you could
have 5 workers.

On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya 
wrote:

> hi folks, I have this in code
>
> *globalIndexJson.apply("GCSOutput",
> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));*
>
> the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
> changing numShards for larger datasize will write to GCS faster?
>