Re: Will multiple filters on the same RDD optimized to one filter?
Depending on what you do with them, they will get computed separately. Bcoz u may have long dag in each branch. So spark tries to run all the transformation function together rather than trying to optimize things across branches. On Jul 16, 2015 1:40 PM, "Bin Wang" wrote: > What if I would use both rdd1 and rdd2 later? > > Raghavendra Pandey 于2015年7月16日周四 下午4:08写道: > >> If you cache rdd it will save some operations. But anyway filter is a >> lazy operation. And it runs based on what you will do later on with rdd1 >> and rdd2... >> >> Raghavendra >> On Jul 16, 2015 1:33 PM, "Bin Wang" wrote: >> >>> If I write code like this: >>> >>> val rdd = input.map(_.value) >>> val f1 = rdd.filter(_ == 1) >>> val f2 = rdd.filter(_ == 2) >>> ... >>> >>> Then the DAG of the execution may be this: >>> >>> -> Filter -> ... >>> Map >>> -> Filter -> ... >>> >>> But the two filters is operated on the same RDD, which means it could be >>> done by just scan the RDD once. Does spark have this kind optimization for >>> now? >>> >>
Re: Will multiple filters on the same RDD optimized to one filter?
What if I would use both rdd1 and rdd2 later? Raghavendra Pandey 于2015年7月16日周四 下午4:08写道: > If you cache rdd it will save some operations. But anyway filter is a lazy > operation. And it runs based on what you will do later on with rdd1 and > rdd2... > > Raghavendra > On Jul 16, 2015 1:33 PM, "Bin Wang" wrote: > >> If I write code like this: >> >> val rdd = input.map(_.value) >> val f1 = rdd.filter(_ == 1) >> val f2 = rdd.filter(_ == 2) >> ... >> >> Then the DAG of the execution may be this: >> >> -> Filter -> ... >> Map >> -> Filter -> ... >> >> But the two filters is operated on the same RDD, which means it could be >> done by just scan the RDD once. Does spark have this kind optimization for >> now? >> >
Re: Will multiple filters on the same RDD optimized to one filter?
If you cache rdd it will save some operations. But anyway filter is a lazy operation. And it runs based on what you will do later on with rdd1 and rdd2... Raghavendra On Jul 16, 2015 1:33 PM, "Bin Wang" wrote: > If I write code like this: > > val rdd = input.map(_.value) > val f1 = rdd.filter(_ == 1) > val f2 = rdd.filter(_ == 2) > ... > > Then the DAG of the execution may be this: > > -> Filter -> ... > Map > -> Filter -> ... > > But the two filters is operated on the same RDD, which means it could be > done by just scan the RDD once. Does spark have this kind optimization for > now? >