Re: Processing many map only collections in single pipeline with spark

Josh Wills Sat, 16 Jul 2016 20:32:44 -0700

The TL;DR is that Spark doesn't really have a proper multiple outputs model
ala Crunch/MR-- i.e., jobs aren't kicked off until you do some sort of
write, and as soon as you do a write, Spark myopically executes all of the
code that needs to happen in order for that write to be completed. You need
to be fairly clever about sequencing your writes and doing intermediate
caching to make sure your pipeline executes efficiently.


On the Crunch/MR side, I'm a little surprised that we're only executing the
map-only jobs one at a time-- I'm assuming you're not mucking with the
crunch.max.running.jobs parameter in some way? (see:
https://crunch.apache.org/user-guide.html#mrpipeline )


J

On Sat, Jul 16, 2016 at 7:29 PM, David Ortiz <[email protected]> wrote:

> Hmm.  Just out of curiosity, what if you do Pipeline.read in place of
> readTextFile?
>
> On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <[email protected]> wrote:
>
>> Nope, it queues up the jobs in series there too.
>>
>> On Jul 16, 2016, at 6:01 PM, David Ortiz <[email protected]> wrote:
>>
>> *run in parallel
>>
>> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <[email protected]> wrote:
>>
>>> Just out of curiosity, if you use mrpipeline does it fun on parallel?
>>> If so, issue may be in spark since I believe crunch leaves it to spark to
>>> handle best method of execution.
>>>
>>> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <[email protected]> wrote:
>>>
>>>> Hey David,
>>>>
>>>> I have 100 active executors, each job typically only uses a few.  It’s
>>>> running on yarn.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Jul 16, 2016, at 12:53 PM, David Ortiz <[email protected]> wrote:
>>>>
>>>> What are the cluster resources available vs what a single map uses?
>>>>
>>>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected]> wrote:
>>>>
>>>>> I enabled FAIR scheduling hoping that would help but only one job is
>>>>> showing up a time.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected]> wrote:
>>>>>
>>>>> Each input is of a different format, and the DoFn implementation
>>>>> handles them depending on instantiation parameters.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Instead of using readTextFile on the pipeline, try using the read
>>>>> method and use the TextFileSource, which can accept in a collection of
>>>>> paths.
>>>>>
>>>>>
>>>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected]
>>>>> > wrote:
>>>>>
>>>>> Hello,
>>>>>>
>>>>>> I have a job configured the following way:
>>>>>>
>>>>>> for (String path : paths) {
>>>>>>     PCollection<String> col = pipeline.readTextFile(path);
>>>>>>     col.parallelDo(new MyDoFn(path), 
>>>>>> Writables.strings()).write(To.textFile(“out/“ + path), 
>>>>>> Target.WriteMode.APPEND);
>>>>>> }
>>>>>> pipeline.done();
>>>>>>
>>>>>> It results in one spark job for each path, and the jobs run in sequence 
>>>>>> even though there are no dependencies.  Is it possible to have the jobs 
>>>>>> run in parallel?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>

Re: Processing many map only collections in single pipeline with spark

Reply via email to