Hi Augusto,
In Beam there is no way to specify how parallel a specific transform
should be. There is only a general indicator for how parallel a pipeline
should be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism".
You should see 16 parallel operations for your Read if you configure a
parallelism of 16.
Cheers,
Max
The easist way to influence the parallelism would be to write a custom
Read operation
On 27.03.19 13:05, augusto....@gmail.com wrote:
Hi all,
The input of my pipeline is a file (~360k lines) with S3 paths using TextIO.read(). I use
these S3 paths to retrieve the files in my next transform. I believe when I tried to run
it a spark cluster on EMR, even though the machine that I setup had 16 cores, the
"read from s3" task got only split into 8 parts. Since this operation is very
IO intensive I believe it would benefit from having a lot more parallelism. Is there a
way to define how parallel a certain operation should be (I believe this option exists
for Spark at least), or is this the wrong way to go about it.
Best regards,
Augusto