Hmm. Just out of curiosity, what if you do Pipeline.read in place of readTextFile?
On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <[email protected]> wrote: > Nope, it queues up the jobs in series there too. > > On Jul 16, 2016, at 6:01 PM, David Ortiz <[email protected]> wrote: > > *run in parallel > > On Sat, Jul 16, 2016, 5:36 PM David Ortiz <[email protected]> wrote: > >> Just out of curiosity, if you use mrpipeline does it fun on parallel? If >> so, issue may be in spark since I believe crunch leaves it to spark to >> handle best method of execution. >> >> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <[email protected]> wrote: >> >>> Hey David, >>> >>> I have 100 active executors, each job typically only uses a few. It’s >>> running on yarn. >>> >>> Thanks, >>> Ben >>> >>> On Jul 16, 2016, at 12:53 PM, David Ortiz <[email protected]> wrote: >>> >>> What are the cluster resources available vs what a single map uses? >>> >>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected]> wrote: >>> >>>> I enabled FAIR scheduling hoping that would help but only one job is >>>> showing up a time. >>>> >>>> Thanks, >>>> Ben >>>> >>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected]> wrote: >>>> >>>> Each input is of a different format, and the DoFn implementation >>>> handles them depending on instantiation parameters. >>>> >>>> Thanks, >>>> Ben >>>> >>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected]> wrote: >>>> >>>> Instead of using readTextFile on the pipeline, try using the read >>>> method and use the TextFileSource, which can accept in a collection of >>>> paths. >>>> >>>> >>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java >>>> >>>> >>>> >>>> >>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected]> >>>> wrote: >>>> >>>> Hello, >>>>> >>>>> I have a job configured the following way: >>>>> >>>>> for (String path : paths) { >>>>> PCollection<String> col = pipeline.readTextFile(path); >>>>> col.parallelDo(new MyDoFn(path), >>>>> Writables.strings()).write(To.textFile(“out/“ + path), >>>>> Target.WriteMode.APPEND); >>>>> } >>>>> pipeline.done(); >>>>> >>>>> It results in one spark job for each path, and the jobs run in sequence >>>>> even though there are no dependencies. Is it possible to have the jobs >>>>> run in parallel? >>>>> >>>>> Thanks, >>>>> >>>>> Ben >>>>> >>>>> >>>>> >>>> >>>> >>> >
