Re: Imposing parallelism

Maximilian Michels Fri, 29 Mar 2019 07:24:09 -0700

Hi Augusto,

In Beam there is no way to specify how parallel a specific transformshould be. There is only a general indicator for how parallel a pipelineshould be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism".

You should see 16 parallel operations for your Read if you configure aparallelism of 16.


Cheers,
Max

The easist way to influence the parallelism would be to write a customRead operation

On 27.03.19 13:05, augusto....@gmail.com wrote:

Hi all,

The input of my pipeline is a file (~360k lines) with S3 paths using TextIO.read(). I use 
these S3 paths to retrieve the files in my next transform. I believe when I tried to run 
it a spark cluster on EMR, even though the machine that I setup had 16 cores, the 
"read from s3" task got only split into 8 parts. Since this operation is very 
IO intensive I believe it would benefit from having a lot more parallelism. Is there a 
way to define how parallel a certain operation should be (I believe this option exists 
for Spark at least), or is this the wrong way to go about it.

Best regards,
Augusto

Re: Imposing parallelism

Reply via email to