Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread YaoPau
I have an RDD x of millions of STRINGs, each of which I want to pass through a set of filters. My filtering code looks like this: x.filter(filter#1, which will filter out 40% of data). filter(filter#2, which will filter out 20% of data). filter(filter#3, which will filter out 2% of data).

Re: Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread Aaron Davidson
In the situation you show, Spark will pipeline each filter together, and will apply each filter one at a time to each row, effectively constructing an statement. You would only see a performance difference if the filter code itself is somewhat expensive, then you would want to only execute it on