subject:"Union of RDDs without the overhead of Union"

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers

well the "hadoop" way is to save to a/b and a/c and read from a/* :)

On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Spark users and developers,
>
> anyone knows how to union two RDDs without the overhead of it?
>
> say rdd1.union(rdd2).saveTextFile(..)
> This requires a stage to union the 2 rdds before saveAsTextFile (2
> stages). Is there a way to skip the union step but have the contents of the
> two rdds save to the same output text file?
>
> Thank you!
>
> Jerry
>

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers

i am surprised union introduces a stage. UnionRDD should have only narrow
dependencies.

On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:

> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>
> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark users and developers,
>>
>> anyone knows how to union two RDDs without the overhead of it?
>>
>> say rdd1.union(rdd2).saveTextFile(..)
>> This requires a stage to union the 2 rdds before saveAsTextFile (2
>> stages). Is there a way to skip the union step but have the contents of the
>> two rdds save to the same output text file?
>>
>> Thank you!
>>
>> Jerry
>>
>
>

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra

Agree with Koert that UnionRDD should have a narrow dependencies .
Although union of two RDDs increases the number of tasks to be executed (
rdd1.partitions + rdd2.partitions) .
If your two RDDs have same number of partitions , you can also use
zipPartitions, which causes lesser number of tasks, hence less overhead.

On Wed, Feb 3, 2016 at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i am surprised union introduces a stage. UnionRDD should have only narrow
> dependencies.
>
> On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>>
>> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Spark users and developers,
>>>
>>> anyone knows how to union two RDDs without the overhead of it?
>>>
>>> say rdd1.union(rdd2).saveTextFile(..)
>>> This requires a stage to union the 2 rdds before saveAsTextFile (2
>>> stages). Is there a way to skip the union step but have the contents of the
>>> two rdds save to the same output text file?
>>>
>>> Thank you!
>>>
>>> Jerry
>>>
>>
>>
>

-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam

Hi Spark users and developers,

anyone knows how to union two RDDs without the overhead of it?

say rdd1.union(rdd2).saveTextFile(..)
This requires a stage to union the 2 rdds before saveAsTextFile (2 stages).
Is there a way to skip the union step but have the contents of the two rdds
save to the same output text file?

Thank you!

Jerry

Re: Union of RDDs without the overhead of Union

Re: Union of RDDs without the overhead of Union

Re: Union of RDDs without the overhead of Union

Union of RDDs without the overhead of Union

4 matches

Site Navigation

Mail list logo

Footer information