Just out of curiosity, if you use mrpipeline does it fun on parallel? If so, issue may be in spark since I believe crunch leaves it to spark to handle best method of execution.
On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <[email protected]> wrote: > Hey David, > > I have 100 active executors, each job typically only uses a few. It’s > running on yarn. > > Thanks, > Ben > > On Jul 16, 2016, at 12:53 PM, David Ortiz <[email protected]> wrote: > > What are the cluster resources available vs what a single map uses? > > On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected]> wrote: > >> I enabled FAIR scheduling hoping that would help but only one job is >> showing up a time. >> >> Thanks, >> Ben >> >> On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected]> wrote: >> >> Each input is of a different format, and the DoFn implementation handles >> them depending on instantiation parameters. >> >> Thanks, >> Ben >> >> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected]> wrote: >> >> Instead of using readTextFile on the pipeline, try using the read method >> and use the TextFileSource, which can accept in a collection of paths. >> >> >> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java >> >> >> >> >> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected]> >> wrote: >> >> Hello, >>> >>> I have a job configured the following way: >>> >>> for (String path : paths) { >>> PCollection<String> col = pipeline.readTextFile(path); >>> col.parallelDo(new MyDoFn(path), >>> Writables.strings()).write(To.textFile(“out/“ + path), >>> Target.WriteMode.APPEND); >>> } >>> pipeline.done(); >>> >>> It results in one spark job for each path, and the jobs run in sequence >>> even though there are no dependencies. Is it possible to have the jobs run >>> in parallel? >>> >>> Thanks, >>> >>> Ben >>> >>> >>> >> >> >
