Re: Processing many map only collections in single pipeline with spark

Ben Juhn Mon, 18 Jul 2016 12:11:32 -0700

Thanks David,

I bumped crunch.max.running.jobs to 10 and am seeing job parallelism with MR.  
I tried the same with spark and am still only seeing one job show up at a time.


Thanks,
Ben

> On Jul 18, 2016, at 11:08 AM, David Ortiz <[email protected]> wrote:
> 
> Sorry.  Meant with MR.  May be more helpful to try and fix the issue there, 
> then see if it carries over to Spark or not since we are not sure if we 
> expect that to work at all.
>  
> From: Ben Juhn [mailto:[email protected]] 
> Sent: Monday, July 18, 2016 2:05 PM
> To: [email protected]
> Subject: Re: Processing many map only collections in single pipeline with 
> spark
>  
> It’s doing the same thing.  One job shows up in the spark UI at a time.
>  
> Thanks,
> Ben
> On Jul 16, 2016, at 7:29 PM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> Hmm.  Just out of curiosity, what if you do Pipeline.read in place of 
> readTextFile?
>  
> On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <[email protected] 
> <mailto:[email protected]>> wrote:
> Nope, it queues up the jobs in series there too.
>  
> On Jul 16, 2016, at 6:01 PM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> *run in parallel
>  
> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Just out of curiosity, if you use mrpipeline does it fun on parallel?  If so, 
> issue may be in spark since I believe crunch leaves it to spark to handle 
> best method of execution.
>  
> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey David,
>  
> I have 100 active executors, each job typically only uses a few.  It’s 
> running on yarn.
>  
> Thanks,
> Ben
>  
> On Jul 16, 2016, at 12:53 PM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> What are the cluster resources available vs what a single map uses?
>  
> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected] 
> <mailto:[email protected]>> wrote:
> I enabled FAIR scheduling hoping that would help but only one job is showing 
> up a time.
>  
> Thanks,
> Ben
>  
> On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> Each input is of a different format, and the DoFn implementation handles them 
> depending on instantiation parameters.
>  
> Thanks,
> Ben
>  
> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> Instead of using readTextFile on the pipeline, try using the read method and 
> use the TextFileSource, which can accept in a collection of paths. 
> 
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
>  
> <https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java>
>  
> 
> 
> 
> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hello,
>  
> I have a job configured the following way:
> for (String path : paths) {
>     PCollection<String> col = pipeline.readTextFile(path);
>     col.parallelDo(new MyDoFn(path), 
> Writables.strings()).write(To.textFile(“out/“ + path), 
> Target.WriteMode.APPEND);
> }
> pipeline.done();
> It results in one spark job for each path, and the jobs run in sequence even 
> though there are no dependencies.  Is it possible to have the jobs run in 
> parallel?
> Thanks,
> Ben
>  
>  
>  
>  
>  
>  
> This email is intended only for the use of the individual(s) to whom it is 
> addressed. If you have received this communication in error, please 
> immediately notify the sender and delete the original email.

Re: Processing many map only collections in single pipeline with spark

Reply via email to