Hi Max, Again, thanks for your answer. Is there anyone that can point me to some example or documentation on how to develop your own reader?
Is this (https://beam.apache.org/documentation/io/developing-io-java/) the best reference to look at? Best regards, Augusto On 2019/03/29 14:23:42, Maximilian Michels <[email protected]> wrote: > Hi Augusto, > > In Beam there is no way to specify how parallel a specific transform > should be. There is only a general indicator for how parallel a pipeline > should be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism". > > You should see 16 parallel operations for your Read if you configure a > parallelism of 16. > > Cheers, > Max > > The easist way to influence the parallelism would be to write a custom > Read operation > On 27.03.19 13:05, [email protected] wrote: > > Hi all, > > > > The input of my pipeline is a file (~360k lines) with S3 paths using > > TextIO.read(). I use these S3 paths to retrieve the files in my next > > transform. I believe when I tried to run it a spark cluster on EMR, even > > though the machine that I setup had 16 cores, the "read from s3" task got > > only split into 8 parts. Since this operation is very IO intensive I > > believe it would benefit from having a lot more parallelism. Is there a way > > to define how parallel a certain operation should be (I believe this option > > exists for Spark at least), or is this the wrong way to go about it. > > > > Best regards, > > Augusto > > >
