Hi,
I need to split a RDD into 3 different RDD using filter-transformation.
I have cached the original RDD before using filter.
The input is lopsided leaving some executors with heavy load while others
with less; so I have repartitioned it.
*DAG-lineage I expected:*
I/P RDD --> MAP RDD --> SHUFFLE RDD (repartition) -->
*MAP RDD (cache)* --> FILTER RDD1 --> MAP1 --> UNION RDD --> O/P RDD
--> FILTER RDD2 --> MAP2
--> FILTER RDD3 --> MAP3
*DAG-lineage I observed:*
I/P RDD --> MAP RDD -->
SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1
SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD2 --> MAP2
SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD3 --> MAP3 -->
UNION RDD --> O/P RDD
Also I Spark-UI shows that no RDD partitioned are actually being cached.
How do I split then without shuffling thrice?
Regards,
Sushrut Ikhar
[image: https://]about.me/sushrutikhar
<https://about.me/sushrutikhar?promo=email_sig>