Hi, You should perform an action (e.g. count, take, saveAs*, etc. ) in order for your RDDs to be cached since cache/persist are lazy functions. You might also want to do coalesce instead of repartition to avoid shuffling.
Thanks, Deng On Mon, Nov 2, 2015 at 5:53 PM, Sushrut Ikhar <sushrutikha...@gmail.com> wrote: > Hi, > I need to split a RDD into 3 different RDD using filter-transformation. > I have cached the original RDD before using filter. > The input is lopsided leaving some executors with heavy load while others > with less; so I have repartitioned it. > > *DAG-lineage I expected:* > > I/P RDD --> MAP RDD --> SHUFFLE RDD (repartition) --> > > *MAP RDD (cache)* --> FILTER RDD1 --> MAP1 --> UNION RDD --> O/P RDD > --> FILTER RDD2 --> MAP2 > --> FILTER RDD3 --> MAP3 > > *DAG-lineage I observed:* > > I/P RDD --> MAP RDD --> > > SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD1 --> MAP1 > SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD2 --> MAP2 > SHUFFLE RDD (repartition) --> *MAP RDD (cache)* --> FILTER RDD3 --> MAP3 > --> > > UNION RDD --> O/P RDD > > Also I Spark-UI shows that no RDD partitioned are actually being cached. > > How do I split then without shuffling thrice? > Regards, > > Sushrut Ikhar > [image: https://]about.me/sushrutikhar > <https://about.me/sushrutikhar?promo=email_sig> > >