Hi all, The input of my pipeline is a file (~360k lines) with S3 paths using TextIO.read(). I use these S3 paths to retrieve the files in my next transform. I believe when I tried to run it a spark cluster on EMR, even though the machine that I setup had 16 cores, the "read from s3" task got only split into 8 parts. Since this operation is very IO intensive I believe it would benefit from having a lot more parallelism. Is there a way to define how parallel a certain operation should be (I believe this option exists for Spark at least), or is this the wrong way to go about it.
Best regards, Augusto
