Re: Processing many map only collections in single pipeline with spark

David Ortiz Sat, 16 Jul 2016 19:29:42 -0700

Hmm.  Just out of curiosity, what if you do Pipeline.read in place of
readTextFile?


On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <[email protected]> wrote:

> Nope, it queues up the jobs in series there too.
>
> On Jul 16, 2016, at 6:01 PM, David Ortiz <[email protected]> wrote:
>
> *run in parallel
>
> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <[email protected]> wrote:
>
>> Just out of curiosity, if you use mrpipeline does it fun on parallel?  If
>> so, issue may be in spark since I believe crunch leaves it to spark to
>> handle best method of execution.
>>
>> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <[email protected]> wrote:
>>
>>> Hey David,
>>>
>>> I have 100 active executors, each job typically only uses a few.  It’s
>>> running on yarn.
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jul 16, 2016, at 12:53 PM, David Ortiz <[email protected]> wrote:
>>>
>>> What are the cluster resources available vs what a single map uses?
>>>
>>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected]> wrote:
>>>
>>>> I enabled FAIR scheduling hoping that would help but only one job is
>>>> showing up a time.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected]> wrote:
>>>>
>>>> Each input is of a different format, and the DoFn implementation
>>>> handles them depending on instantiation parameters.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected]> wrote:
>>>>
>>>> Instead of using readTextFile on the pipeline, try using the read
>>>> method and use the TextFileSource, which can accept in a collection of
>>>> paths.
>>>>
>>>>
>>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected]>
>>>> wrote:
>>>>
>>>> Hello,
>>>>>
>>>>> I have a job configured the following way:
>>>>>
>>>>> for (String path : paths) {
>>>>>     PCollection<String> col = pipeline.readTextFile(path);
>>>>>     col.parallelDo(new MyDoFn(path), 
>>>>> Writables.strings()).write(To.textFile(“out/“ + path), 
>>>>> Target.WriteMode.APPEND);
>>>>> }
>>>>> pipeline.done();
>>>>>
>>>>> It results in one spark job for each path, and the jobs run in sequence 
>>>>> even though there are no dependencies.  Is it possible to have the jobs 
>>>>> run in parallel?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ben
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>

Re: Processing many map only collections in single pipeline with spark

Reply via email to