Imposing parallelism

augusto . mcc Wed, 27 Mar 2019 05:05:49 -0700

Hi all,

The input of my pipeline is a file (~360k lines) with S3 paths using 
TextIO.read(). I use these S3 paths to retrieve the files in my next transform. 
I believe when I tried to run it a spark cluster on EMR, even though the 
machine that I setup had 16 cores, the "read from s3" task got only split into 
8 parts. Since this operation is very IO intensive I believe it would benefit 
from having a lot more parallelism. Is there a way to define how parallel a 
certain operation should be (I believe this option exists for Spark at least), 
or is this the wrong way to go about it.


Best regards,
Augusto

Imposing parallelism

Reply via email to