JacobZheng created SPARK-46071:
----------------------------------

             Summary: TreeNode.toJSON may result in OOM when there are multiple 
levels of nesting of expressions.
                 Key: SPARK-46071
                 URL: https://issues.apache.org/jira/browse/SPARK-46071
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.4.0
            Reporter: JacobZheng


I am encountering an OOM exception when executing the following code:
{code:scala}
    parser.parseExpression(sql).toJSON
{code}
This sql is a multiple nesting of {*}_CaseWhen_{*}. After testing I found that 
the number of expressions in the json increases exponentially as the number of 
nestings increases.

Here are some example:

sql:
{code:sql}
CASE WHEN(`cost` <= 275) THEN '(270-275]' 
ELSE '----' END
{code}
json:
{code:json}
[
    {
        "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen",
        "num-children":3,
        "branches":[
            {
                "product-class":"scala.Tuple2",
                "_1":[
                    {
                        
"class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual",
                        "num-children":2,
                        "left":0,
                        "right":1
                    },
                    {
                        
"class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute",
                        "num-children":0,
                        "nameParts":"[cost]"
                    },
                    {
                        
"class":"org.apache.spark.sql.catalyst.expressions.Literal",
                        "num-children":0,
                        "value":"275",
                        "dataType":"integer"
                    }
                ],
                "_2":[
                    {
                        
"class":"org.apache.spark.sql.catalyst.expressions.Literal",
                        "num-children":0,
                        "value":"(270-275]",
                        "dataType":"string"
                    }
                ]
            }
        ],
        "elseValue":[
            {
                "class":"org.apache.spark.sql.catalyst.expressions.Literal",
                "num-children":0,
                "value":"----",
                "dataType":"string"
            }
        ]
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual",
        "num-children":2,
        "left":0,
        "right":1
    },
    {
        "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute",
        "num-children":0,
        "nameParts":"[cost]"
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children":0,
        "value":"275",
        "dataType":"integer"
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children":0,
        "value":"(270-275]",
        "dataType":"string"
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children":0,
        "value":"----",
        "dataType":"string"
    }
]
{code}
The child nodes of the *_CaseWhen_* expression are stored twice in JSON.

When *_CaseWhen_* is nested twice, the child expression of the first case when 
is repeated 4 times, and so on.
{code:sql}
CASE WHEN(`cost` <= 270) THEN '(265-270]'
ELSE 
    CASE WHEN(`cost` <= 275) THEN '(270-275]' 
    ELSE '----' END END
{code}
Nesting the *_CaseWhen_* expression n times in this case will result in 2^n+11 
expressions in the json.

The reason for this problem is that the field of *_CaseWhen_* cannot be 
converted to children index of when executing method {*}_jsonFields_{*}.

Perhaps simplifying *_CaseWhen_* json a bit by overriding the *_jsonFields_* 
method is a viable way to go.
{code:json}
[
    {
        "class":"org.apache.spark.sql.catalyst.expressions.CaseWhen",
        "num-children":3,
        "branches":[
            {
                "condition":0,
                "value":1
            }
        ],
        "elseValue":2
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.LessThanOrEqual",
        "num-children":2,
        "left":0,
        "right":1
    },
    {
        "class":"org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute",
        "num-children":0,
        "nameParts":"[cost]"
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children":0,
        "value":"275",
        "dataType":"integer"
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children":0,
        "value":"(270-275]",
        "dataType":"string"
    },
    {
        "class":"org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children":0,
        "value":"----",
        "dataType":"string"
    }
]
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to