[ https://issues.apache.org/jira/browse/SPARK-46071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-46071: ----------------------------------- Labels: pull-request-available (was: ) > TreeNode.toJSON may result in OOM when there are multiple levels of nesting > of expressions. > ------------------------------------------------------------------------------------------- > > Key: SPARK-46071 > URL: https://issues.apache.org/jira/browse/SPARK-46071 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0 > Reporter: JacobZheng > Priority: Major > Labels: pull-request-available > > I am encountering an OOM exception when executing the following code: > {code:scala} > parser.parseExpression(sql).toJSON > {code} > This sql is a multiple nesting of {*}_CaseWhen_{*}. After testing I found > that the number of expressions in the json increases exponentially as the > number of nestings increases. > Here are some example: > sql: > {code:sql} > CASE WHEN(`cost` <= 275) THEN '(270-275]' > ELSE '----' END > {code} > json: > {code:json} > [ > { > "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen", > "num-children":3, > "branches":[ > { > "product-class":"scala.Tuple2", > "_1":[ > { > > "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual", > "num-children":2, > "left":0, > "right":1 > }, > { > > "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute", > "num-children":0, > "nameParts":"[cost]" > }, > { > > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"275", > "dataType":"integer" > } > ], > "_2":[ > { > > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"(270-275]", > "dataType":"string" > } > ] > } > ], > "elseValue":[ > { > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"----", > "dataType":"string" > } > ] > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual", > "num-children":2, > "left":0, > "right":1 > }, > { > "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute", > "num-children":0, > "nameParts":"[cost]" > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"275", > "dataType":"integer" > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"(270-275]", > "dataType":"string" > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"----", > "dataType":"string" > } > ] > {code} > The child nodes of the *_CaseWhen_* expression are stored twice in JSON. > When *_CaseWhen_* is nested twice, the child expression of the first case > when is repeated 4 times, and so on. > {code:sql} > CASE WHEN(`cost` <= 270) THEN '(265-270]' > ELSE > CASE WHEN(`cost` <= 275) THEN '(270-275]' > ELSE '----' END END > {code} > Nesting the *_CaseWhen_* expression n times in this case will result in > 2^n+11 expressions in the json. > The reason for this problem is that the field of *_CaseWhen_* cannot be > converted to children index of when executing method {*}_jsonFields_{*}. > Perhaps simplifying *_CaseWhen_* json a bit by overriding the *_jsonFields_* > method is a viable way to go. > {code:json} > [ > { > "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen", > "num-children":3, > "branches":[ > { > "condition":0, > "value":1 > } > ], > "elseValue":2 > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual", > "num-children":2, > "left":0, > "right":1 > }, > { > "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute", > "num-children":0, > "nameParts":"[cost]" > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"275", > "dataType":"integer" > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"(270-275]", > "dataType":"string" > }, > { > "class":"org.apache.spark.sql.catalyst.expressions.Literal", > "num-children":0, > "value":"----", > "dataType":"string" > } > ] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org