Hi Max,

Again, thanks for your answer. Is there anyone that can point me to some 
example or documentation on how to develop your own reader?

Is this (https://beam.apache.org/documentation/io/developing-io-java/) the best 
reference to look at? 

Best regards,
Augusto 

On 2019/03/29 14:23:42, Maximilian Michels <[email protected]> wrote: 
> Hi Augusto,
> 
> In Beam there is no way to specify how parallel a specific transform 
> should be. There is only a general indicator for how parallel a pipeline 
> should be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism".
> 
> You should see 16 parallel operations for your Read if you configure a 
> parallelism of 16.
> 
> Cheers,
> Max
> 
> The easist way to influence the parallelism would be to write a custom 
> Read operation
> On 27.03.19 13:05, [email protected] wrote:
> > Hi all,
> > 
> > The input of my pipeline is a file (~360k lines) with S3 paths using 
> > TextIO.read(). I use these S3 paths to retrieve the files in my next 
> > transform. I believe when I tried to run it a spark cluster on EMR, even 
> > though the machine that I setup had 16 cores, the "read from s3" task got 
> > only split into 8 parts. Since this operation is very IO intensive I 
> > believe it would benefit from having a lot more parallelism. Is there a way 
> > to define how parallel a certain operation should be (I believe this option 
> > exists for Spark at least), or is this the wrong way to go about it.
> > 
> > Best regards,
> > Augusto
> > 
> 

Reply via email to