In the situation you show, Spark will pipeline each filter together, and will apply each filter one at a time to each row, effectively constructing an "&&" statement. You would only see a performance difference if the filter code itself is somewhat expensive, then you would want to only execute it on a smaller set of rows. Otherwise, the runtime difference between "a == b && b == c && c ==d" is minimal when compared to "a == b & b == c & c == d", the latter being sort of the worst-case scenario as it would always run all filters (though as I said, Spark acts like the former).
Spark does not reorder the filters automatically. It uses the explicit ordering you provide. On Fri, Nov 14, 2014 at 10:20 AM, YaoPau <jonrgr...@gmail.com> wrote: > I have an RDD "x" of millions of STRINGs, each of which I want to pass > through a set of filters. My filtering code looks like this: > > x.filter(filter#1, which will filter out 40% of data). > filter(filter#2, which will filter out 20% of data). > filter(filter#3, which will filter out 2% of data). > filter(filter#4, which will filter out 1% of data) > > There is no ordering requirement (filter #2 does not depend on filter #1, > etc), but the filters are drastically different in the % of rows they > should > eliminate. What I'd like is an ordering similar to a "||" statement, where > if it fails on filter#1 the row automatically gets filtered out before the > other three filters run. > > But when I play around with the ordering of the filters, the runtime > doesn't > seem to change. Is Spark somehow intelligently guessing how effective each > filter will be and ordering it correctly regardless of how I order them? > If > not, is there I way I can set the filter order? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Given-multiple-filter-s-is-there-a-way-to-set-the-order-tp18957.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >