Re: Catalyst: Reusing already computed expressions within a projection

2015-05-31 Thread Justin Uang
Thanks for pointing to that link! It looks like it’s useful, but it does look more complicated than the case I’m trying to address. In my case, we set y = f(x), then we use y later on in future projections (z = g(y)). In that case, the analysis is trivial in that we aren’t trying to find

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-31 Thread Reynold Xin
I think Michael's bringing up code gen because the compiler (not Spark, but javac and JVM JIT) already does common subexpression elimination, so we might get it for free during code gen. On Sun, May 31, 2015 at 11:48 AM, Justin Uang justin.u...@gmail.com wrote: Thanks for pointing to that

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Reynold Xin
I think you are looking for http://en.wikipedia.org/wiki/Common_subexpression_elimination in the optimizer. One thing to note is that as we do more and more optimization like this, the optimization time might increase. Do you see a case where this can bring you substantial performance gains? On

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Michael Armbrust
I think this is likely something that we'll want to do during the code generation phase. Though its probably not the lowest hanging fruit at this point. On Sun, May 31, 2015 at 5:02 AM, Reynold Xin r...@databricks.com wrote: I think you are looking for

Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain() Then the result is Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] PhysicalRDD

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
On second thought, perhaps can this be done by writing a rule that builds the dag of dependencies between expressions, then convert it into several layers of projections, where each new layer is allowed to depend on expression results from previous projections? Are there any pitfalls to this