Re: Processing many map only collections in single pipeline with spark

David Ortiz Sat, 16 Jul 2016 12:54:06 -0700

What are the cluster resources available vs what a single map uses?

On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[email protected]> wrote:


> I enabled FAIR scheduling hoping that would help but only one job is
> showing up a time.
>
> Thanks,
> Ben
>
> On Jul 15, 2016, at 8:17 PM, Ben Juhn <[email protected]> wrote:
>
> Each input is of a different format, and the DoFn implementation handles
> them depending on instantiation parameters.
>
> Thanks,
> Ben
>
> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[email protected]> wrote:
>
> Instead of using readTextFile on the pipeline, try using the read method
> and use the TextFileSource, which can accept in a collection of paths.
>
>
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
>
>
>
>
> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[email protected]>
> wrote:
>
> Hello,
>>
>> I have a job configured the following way:
>>
>> for (String path : paths) {
>>     PCollection<String> col = pipeline.readTextFile(path);
>>     col.parallelDo(new MyDoFn(path), 
>> Writables.strings()).write(To.textFile(“out/“ + path), 
>> Target.WriteMode.APPEND);
>> }
>> pipeline.done();
>>
>> It results in one spark job for each path, and the jobs run in sequence even 
>> though there are no dependencies.  Is it possible to have the jobs run in 
>> parallel?
>>
>> Thanks,
>>
>> Ben
>>
>>
>>
>
>

Re: Processing many map only collections in single pipeline with spark

Reply via email to