Re: Processing many map only collections in single pipeline with spark

Ben Juhn Fri, 15 Jul 2016 20:17:48 -0700

Each input is of a different format, and the DoFn implementation handles them 
depending on instantiation parameters.


Thanks,
Ben

> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected]> wrote:
> 
> Instead of using readTextFile on the pipeline, try using the read method and 
> use the TextFileSource, which can accept in a collection of paths. 
> 
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
> 
> 
> 
> 
> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hello,
> 
> I have a job configured the following way:
> for (String path : paths) {
>     PCollection<String> col = pipeline.readTextFile(path);
>     col.parallelDo(new MyDoFn(path), 
> Writables.strings()).write(To.textFile(“out/“ + path), 
> Target.WriteMode.APPEND);
> }
> pipeline.done();
> It results in one spark job for each path, and the jobs run in sequence even 
> though there are no dependencies.  Is it possible to have the jobs run in 
> parallel?
> Thanks,
> Ben
>

Re: Processing many map only collections in single pipeline with spark

Reply via email to