Hi all,

The input of my pipeline is a file (~360k lines) with S3 paths using 
TextIO.read(). I use these S3 paths to retrieve the files in my next transform. 
I believe when I tried to run it a spark cluster on EMR, even though the 
machine that I setup had 16 cores, the "read from s3" task got only split into 
8 parts. Since this operation is very IO intensive I believe it would benefit 
from having a lot more parallelism. Is there a way to define how parallel a 
certain operation should be (I believe this option exists for Spark at least), 
or is this the wrong way to go about it.

Best regards,
Augusto 

Reply via email to