Each input is of a different format, and the DoFn implementation handles them depending on instantiation parameters.
Thanks, Ben > On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected]> wrote: > > Instead of using readTextFile on the pipeline, try using the read method and > use the TextFileSource, which can accept in a collection of paths. > > https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java > > > > > On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected] > <mailto:[email protected]>> wrote: > > Hello, > > I have a job configured the following way: > for (String path : paths) { > PCollection<String> col = pipeline.readTextFile(path); > col.parallelDo(new MyDoFn(path), > Writables.strings()).write(To.textFile(“out/“ + path), > Target.WriteMode.APPEND); > } > pipeline.done(); > It results in one spark job for each path, and the jobs run in sequence even > though there are no dependencies. Is it possible to have the jobs run in > parallel? > Thanks, > Ben >
