RE: Processing many map only collections in single pipeline with spark

David Ortiz Mon, 18 Jul 2016 11:09:08 -0700

Sorry.  Meant with MR.  May be more helpful to try and fix the issue there, 
then see if it carries over to Spark or not since we are not sure if we expect 
that to work at all.

From: Ben Juhn [mailto:[email protected]]
Sent: Monday, July 18, 2016 2:05 PM
To: [email protected]
Subject: Re: Processing many map only collections in single pipeline with spark

It’s doing the same thing.  One job shows up in the spark UI at a time.

Thanks,
Ben
On Jul 16, 2016, at 7:29 PM, David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:

Hmm.  Just out of curiosity, what if you do Pipeline.read in place of 
readTextFile?

On Sat, Jul 16, 2016, 10:08 PM Ben Juhn 
<[email protected]<mailto:[email protected]>> wrote:
Nope, it queues up the jobs in series there too.

On Jul 16, 2016, at 6:01 PM, David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:

*run in parallel

On Sat, Jul 16, 2016, 5:36 PM David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:
Just out of curiosity, if you use mrpipeline does it fun on parallel?  If so, 
issue may be in spark since I believe crunch leaves it to spark to handle best 
method of execution.

On Sat, Jul 16, 2016, 4:29 PM Ben Juhn 
<[email protected]<mailto:[email protected]>> wrote:
Hey David,

I have 100 active executors, each job typically only uses a few.  It’s running 
on yarn.

Thanks,
Ben

On Jul 16, 2016, at 12:53 PM, David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:

What are the cluster resources available vs what a single map uses?

On Sat, Jul 16, 2016, 3:04 PM Ben Juhn 
<[email protected]<mailto:[email protected]>> wrote:
I enabled FAIR scheduling hoping that would help but only one job is showing up 
a time.

Thanks,
Ben

On Jul 15, 2016, at 8:17 PM, Ben Juhn 
<[email protected]<mailto:[email protected]>> wrote:

Each input is of a different format, and the DoFn implementation handles them 
depending on instantiation parameters.

Thanks,
Ben

On Jul 15, 2016, at 7:09 PM, Stephen Durfey 
<[email protected]<mailto:[email protected]>> wrote:

Instead of using readTextFile on the pipeline, try using the read method and 
use the TextFileSource, which can accept in a collection of paths.

https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java

On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I have a job configured the following way:

for (String path : paths) {
    PCollection<String> col = pipeline.readTextFile(path);
    col.parallelDo(new MyDoFn(path), 
Writables.strings()).write(To.textFile(“out/“ + path), Target.WriteMode.APPEND);
}
pipeline.done();

It results in one spark job for each path, and the jobs run in sequence even 
though there are no dependencies.  Is it possible to have the jobs run in 
parallel?

Thanks,

Ben

This email is intended only for the use of the individual(s) to whom it is 
addressed. If you have received this communication in error, please immediately 
notify the sender and delete the original email.

RE: Processing many map only collections in single pipeline with spark

Reply via email to