Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Raghavendra Pandey
Depending on what you do with them, they will get computed separately.
Bcoz u may have long dag in each branch. So spark tries to run all the
transformation function together rather than trying to optimize things
across branches.
On Jul 16, 2015 1:40 PM, "Bin Wang"  wrote:

> What if I would use both rdd1 and rdd2 later?
>
> Raghavendra Pandey 于2015年7月16日周四 下午4:08写道:
>
>> If you cache rdd it will save some operations. But anyway filter is a
>> lazy operation. And it runs based on what you will do later on with rdd1
>> and rdd2...
>>
>> Raghavendra
>> On Jul 16, 2015 1:33 PM, "Bin Wang"  wrote:
>>
>>> If I write code like this:
>>>
>>> val rdd = input.map(_.value)
>>> val f1 = rdd.filter(_ == 1)
>>> val f2 = rdd.filter(_ == 2)
>>> ...
>>>
>>> Then the DAG of the execution may be this:
>>>
>>>  -> Filter -> ...
>>> Map
>>>  -> Filter -> ...
>>>
>>> But the two filters is operated on the same RDD, which means it could be
>>> done by just scan the RDD once. Does spark have this kind optimization for
>>> now?
>>>
>>


Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
What if I would use both rdd1 and rdd2 later?

Raghavendra Pandey 于2015年7月16日周四 下午4:08写道:

> If you cache rdd it will save some operations. But anyway filter is a lazy
> operation. And it runs based on what you will do later on with rdd1 and
> rdd2...
>
> Raghavendra
> On Jul 16, 2015 1:33 PM, "Bin Wang"  wrote:
>
>> If I write code like this:
>>
>> val rdd = input.map(_.value)
>> val f1 = rdd.filter(_ == 1)
>> val f2 = rdd.filter(_ == 2)
>> ...
>>
>> Then the DAG of the execution may be this:
>>
>>  -> Filter -> ...
>> Map
>>  -> Filter -> ...
>>
>> But the two filters is operated on the same RDD, which means it could be
>> done by just scan the RDD once. Does spark have this kind optimization for
>> now?
>>
>


Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Raghavendra Pandey
If you cache rdd it will save some operations. But anyway filter is a lazy
operation. And it runs based on what you will do later on with rdd1 and
rdd2...

Raghavendra
On Jul 16, 2015 1:33 PM, "Bin Wang"  wrote:

> If I write code like this:
>
> val rdd = input.map(_.value)
> val f1 = rdd.filter(_ == 1)
> val f2 = rdd.filter(_ == 2)
> ...
>
> Then the DAG of the execution may be this:
>
>  -> Filter -> ...
> Map
>  -> Filter -> ...
>
> But the two filters is operated on the same RDD, which means it could be
> done by just scan the RDD once. Does spark have this kind optimization for
> now?
>


Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
If I write code like this:

val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...

Then the DAG of the execution may be this:

 -> Filter -> ...
Map
 -> Filter -> ...

But the two filters is operated on the same RDD, which means it could be
done by just scan the RDD once. Does spark have this kind optimization for
now?