Re: Processing many map only collections in single pipeline with spark

Stephen Durfey Fri, 15 Jul 2016 19:10:03 -0700

Instead of using readTextFile on the pipeline, try using the read method and 
use the TextFileSource, which can accept in a collection of paths.


https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java





On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected]> wrote:










Hello,
I have a job configured the following way:for (String path : paths) {
    PCollection<String> col = pipeline.readTextFile(path);
    col.parallelDo(new MyDoFn(path), 
Writables.strings()).write(To.textFile(“out/“ + path), Target.WriteMode.APPEND);
}
pipeline.done();It results in one spark job for each path, and the jobs run in 
sequence even though there are no dependencies.  Is it possible to have the 
jobs run in parallel?Thanks,Ben

Re: Processing many map only collections in single pipeline with spark

Reply via email to