Hi Augusto,

In Beam there is no way to specify how parallel a specific transform should be. There is only a general indicator for how parallel a pipeline should be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism".

You should see 16 parallel operations for your Read if you configure a parallelism of 16.

Cheers,
Max

The easist way to influence the parallelism would be to write a custom Read operation
On 27.03.19 13:05, augusto....@gmail.com wrote:
Hi all,

The input of my pipeline is a file (~360k lines) with S3 paths using TextIO.read(). I use 
these S3 paths to retrieve the files in my next transform. I believe when I tried to run 
it a spark cluster on EMR, even though the machine that I setup had 16 cores, the 
"read from s3" task got only split into 8 parts. Since this operation is very 
IO intensive I believe it would benefit from having a lot more parallelism. Is there a 
way to define how parallel a certain operation should be (I believe this option exists 
for Spark at least), or is this the wrong way to go about it.

Best regards,
Augusto

Reply via email to