Re: Processing many map only collections in single pipeline with spark

Ben Juhn Sat, 16 Jul 2016 13:29:30 -0700

Hey David,

I have 100 active executors, each job typically only uses a few.  It’s running 
on yarn.


Thanks,
Ben

> On Jul 16, 2016, at 12:53 PM, David Ortiz <[email protected]> wrote:
> 
> What are the cluster resources available vs what a single map uses?
> 
> 
> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected] 
> <mailto:[email protected]>> wrote:
> I enabled FAIR scheduling hoping that would help but only one job is showing 
> up a time.
> 
> Thanks,
> Ben
> 
>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Each input is of a different format, and the DoFn implementation handles 
>> them depending on instantiation parameters.
>> 
>> Thanks,
>> Ben
>> 
>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Instead of using readTextFile on the pipeline, try using the read method 
>>> and use the TextFileSource, which can accept in a collection of paths. 
>>> 
>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
>>>  
>>> <https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java>
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hello,
>>> 
>>> I have a job configured the following way:
>>> for (String path : paths) {
>>>     PCollection<String> col = pipeline.readTextFile(path);
>>>     col.parallelDo(new MyDoFn(path), 
>>> Writables.strings()).write(To.textFile(“out/“ + path), 
>>> Target.WriteMode.APPEND);
>>> }
>>> pipeline.done();
>>> It results in one spark job for each path, and the jobs run in sequence 
>>> even though there are no dependencies.  Is it possible to have the jobs run 
>>> in parallel?
>>> Thanks,
>>> Ben
>>> 
>> 
>

Re: Processing many map only collections in single pipeline with spark

Reply via email to