Re: Split RDD into multiple RDDs using filter-transformation

2015-11-02 Thread Deng Ching-Mallete
Hi,

You should perform an action (e.g. count, take, saveAs*, etc. ) in order
for your RDDs to be cached since cache/persist are lazy functions. You
might also want to do coalesce instead of repartition to avoid shuffling.

Thanks,
Deng

On Mon, Nov 2, 2015 at 5:53 PM, Sushrut Ikhar 
wrote:

> Hi,
> I need to split a RDD into 3 different RDD using filter-transformation.
> I have cached the original RDD before using filter.
> The input is lopsided leaving some executors with heavy load while others
> with less; so I have repartitioned it.
>
> *DAG-lineage I expected:*
>
> I/P RDD  -->  MAP RDD --> SHUFFLE RDD (repartition) -->
>
> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1 --> UNION RDD --> O/P RDD
>--> FILTER RDD2 --> MAP2
>--> FILTER RDD3 --> MAP3
>
> *DAG-lineage I observed:*
>
> I/P RDD  -->  MAP RDD -->
>
> SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1
> SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD2 --> MAP2
> SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD3 --> MAP3
> -->
>
> UNION RDD --> O/P RDD
>
> Also I Spark-UI shows that no RDD partitioned are actually being cached.
>
> How do I split then without shuffling thrice?
> Regards,
>
> Sushrut Ikhar
> [image: https://]about.me/sushrutikhar
> 
>
>


Split RDD into multiple RDDs using filter-transformation

2015-11-02 Thread Sushrut Ikhar
Hi,
I need to split a RDD into 3 different RDD using filter-transformation.
I have cached the original RDD before using filter.
The input is lopsided leaving some executors with heavy load while others
with less; so I have repartitioned it.

*DAG-lineage I expected:*

I/P RDD  -->  MAP RDD --> SHUFFLE RDD (repartition) -->

*MAP RDD (cache)* --> FILTER RDD1 --> MAP1 --> UNION RDD --> O/P RDD
   --> FILTER RDD2 --> MAP2
   --> FILTER RDD3 --> MAP3

*DAG-lineage I observed:*

I/P RDD  -->  MAP RDD -->

SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1
SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD2 --> MAP2
SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD3 --> MAP3 -->

UNION RDD --> O/P RDD

Also I Spark-UI shows that no RDD partitioned are actually being cached.

How do I split then without shuffling thrice?
Regards,

Sushrut Ikhar
[image: https://]about.me/sushrutikhar