Instead of using readTextFile on the pipeline, try using the read method and use the TextFileSource, which can accept in a collection of paths.
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected]> wrote: Hello, I have a job configured the following way:for (String path : paths) { PCollection<String> col = pipeline.readTextFile(path); col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“ + path), Target.WriteMode.APPEND); } pipeline.done();It results in one spark job for each path, and the jobs run in sequence even though there are no dependencies. Is it possible to have the jobs run in parallel?Thanks,Ben
