Xiangrui Meng created SPARK-10371: ------------------------------------- Summary: Optimize sequential projections Key: SPARK-10371 URL: https://issues.apache.org/jira/browse/SPARK-10371 Project: Spark Issue Type: New Feature Components: ML, SQL Affects Versions: 1.5.0 Reporter: Xiangrui Meng
In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. It would be nice to detect this pattern and re-use intermediate values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org