Thanks David, I bumped crunch.max.running.jobs to 10 and am seeing job parallelism with MR. I tried the same with spark and am still only seeing one job show up at a time.
Thanks, Ben > On Jul 18, 2016, at 11:08 AM, David Ortiz <[email protected]> wrote: > > Sorry. Meant with MR. May be more helpful to try and fix the issue there, > then see if it carries over to Spark or not since we are not sure if we > expect that to work at all. > > From: Ben Juhn [mailto:[email protected]] > Sent: Monday, July 18, 2016 2:05 PM > To: [email protected] > Subject: Re: Processing many map only collections in single pipeline with > spark > > It’s doing the same thing. One job shows up in the spark UI at a time. > > Thanks, > Ben > On Jul 16, 2016, at 7:29 PM, David Ortiz <[email protected] > <mailto:[email protected]>> wrote: > > Hmm. Just out of curiosity, what if you do Pipeline.read in place of > readTextFile? > > On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <[email protected] > <mailto:[email protected]>> wrote: > Nope, it queues up the jobs in series there too. > > On Jul 16, 2016, at 6:01 PM, David Ortiz <[email protected] > <mailto:[email protected]>> wrote: > > *run in parallel > > On Sat, Jul 16, 2016, 5:36 PM David Ortiz <[email protected] > <mailto:[email protected]>> wrote: > Just out of curiosity, if you use mrpipeline does it fun on parallel? If so, > issue may be in spark since I believe crunch leaves it to spark to handle > best method of execution. > > On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <[email protected] > <mailto:[email protected]>> wrote: > Hey David, > > I have 100 active executors, each job typically only uses a few. It’s > running on yarn. > > Thanks, > Ben > > On Jul 16, 2016, at 12:53 PM, David Ortiz <[email protected] > <mailto:[email protected]>> wrote: > > What are the cluster resources available vs what a single map uses? > > On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected] > <mailto:[email protected]>> wrote: > I enabled FAIR scheduling hoping that would help but only one job is showing > up a time. > > Thanks, > Ben > > On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected] > <mailto:[email protected]>> wrote: > > Each input is of a different format, and the DoFn implementation handles them > depending on instantiation parameters. > > Thanks, > Ben > > On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected] > <mailto:[email protected]>> wrote: > > Instead of using readTextFile on the pipeline, try using the read method and > use the TextFileSource, which can accept in a collection of paths. > > https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java > > <https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java> > > > > > On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected] > <mailto:[email protected]>> wrote: > > Hello, > > I have a job configured the following way: > for (String path : paths) { > PCollection<String> col = pipeline.readTextFile(path); > col.parallelDo(new MyDoFn(path), > Writables.strings()).write(To.textFile(“out/“ + path), > Target.WriteMode.APPEND); > } > pipeline.done(); > It results in one spark job for each path, and the jobs run in sequence even > though there are no dependencies. Is it possible to have the jobs run in > parallel? > Thanks, > Ben > > > > > > > This email is intended only for the use of the individual(s) to whom it is > addressed. If you have received this communication in error, please > immediately notify the sender and delete the original email.
