JacobZheng created SPARK-46071: ---------------------------------- Summary: TreeNode.toJSON may result in OOM when there are multiple levels of nesting of expressions. Key: SPARK-46071 URL: https://issues.apache.org/jira/browse/SPARK-46071 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: JacobZheng
I am encountering an OOM exception when executing the following code: {code:scala} parser.parseExpression(sql).toJSON {code} This sql is a multiple nesting of {*}_CaseWhen_{*}. After testing I found that the number of expressions in the json increases exponentially as the number of nestings increases. Here are some example: sql: {code:sql} CASE WHEN(`cost` <= 275) THEN '(270-275]' ELSE '----' END {code} json: {code:json} [ { "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen", "num-children":3, "branches":[ { "product-class":"scala.Tuple2", "_1":[ { "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual", "num-children":2, "left":0, "right":1 }, { "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute", "num-children":0, "nameParts":"[cost]" }, { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"275", "dataType":"integer" } ], "_2":[ { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"(270-275]", "dataType":"string" } ] } ], "elseValue":[ { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"----", "dataType":"string" } ] }, { "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual", "num-children":2, "left":0, "right":1 }, { "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute", "num-children":0, "nameParts":"[cost]" }, { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"275", "dataType":"integer" }, { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"(270-275]", "dataType":"string" }, { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"----", "dataType":"string" } ] {code} The child nodes of the *_CaseWhen_* expression are stored twice in JSON. When *_CaseWhen_* is nested twice, the child expression of the first case when is repeated 4 times, and so on. {code:sql} CASE WHEN(`cost` <= 270) THEN '(265-270]' ELSE CASE WHEN(`cost` <= 275) THEN '(270-275]' ELSE '----' END END {code} Nesting the *_CaseWhen_* expression n times in this case will result in 2^n+11 expressions in the json. The reason for this problem is that the field of *_CaseWhen_* cannot be converted to children index of when executing method {*}_jsonFields_{*}. Perhaps simplifying *_CaseWhen_* json a bit by overriding the *_jsonFields_* method is a viable way to go. {code:json} [ { "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen", "num-children":3, "branches":[ { "condition":0, "value":1 } ], "elseValue":2 }, { "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual", "num-children":2, "left":0, "right":1 }, { "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute", "num-children":0, "nameParts":"[cost]" }, { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"275", "dataType":"integer" }, { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"(270-275]", "dataType":"string" }, { "class":"org.apache.spark.sql.catalyst.expressions.Literal", "num-children":0, "value":"----", "dataType":"string" } ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org