David Vogelbacher created SPARK-24983: -----------------------------------------
Summary: Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver Key: SPARK-24983 URL: https://issues.apache.org/jira/browse/SPARK-24983 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.3.1 Reporter: David Vogelbacher Hi, I noticed that writing a spark job that includes many sequential when-otherwise statements on the same column can easily OOM the driver while generating the optimized plan because the project node will grow exponentially in size. Example: {noformat} scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = Seq("a", "b", "c", "1").toDF("text") df: org.apache.spark.sql.DataFrame = [text: string] scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: string] scala> for( a <- 1 to 5) { | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === lit(a.toString), lit("r" + a.toString)).otherwise($"text")) | } scala> dfCaseWhen.queryExecution.analyzed res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] +- Filter NOT (text#3 = 0) +- Project [value#1 AS text#3] +- LocalRelation [value#1] scala> dfCaseWhen.queryExecution.optimizedPlan res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... {noformat} As one can see the optimized plan grows exponentially in the number of {{when-otherwise}} statements here. I can see that this comes from the {{CollapseProject}} optimizer rule. Maybe we should put a limit on the resulting size of the project node after collapsing and only collapse if we stay under the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org