This is an automated email from the ASF dual-hosted git repository.

joshrosen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
     new 5cc8b39  [SPARK-37784][SQL] Correctly handle UDTs in 
CodeGenerator.addBufferedState()
5cc8b39 is described below

commit 5cc8b397f1b01b5ee4a26e8e8540baf1e05c97a0
Author: Josh Rosen <joshro...@databricks.com>
AuthorDate: Tue Jan 4 10:59:53 2022 -0800

    [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()
    
    ### What changes were proposed in this pull request?
    
    This PR fixes a correctness issue in the CodeGenerator.addBufferedState() 
helper method (which is used by the SortMergeJoinExec operator).
    
    The addBufferedState() method generates code for buffering values that come 
from a row in an operator's input iterator, performing any necessary copying so 
that the buffered values remain correct after the input iterator advances to 
the next row.
    
    The current logic does not correctly handle UDTs: these fall through to the 
match statement's default branch, causing UDT values to be buffered without 
copying. This is problematic if the UDT's underlying SQL type is an array, map, 
struct, or string type (since those types require copying). Failing to copy 
values can lead to correctness issues or crashes.
    
    This patch's fix is simple: when the dataType is a UDT, use its underlying 
sqlType for determining whether values need to be copied. I used an existing 
helper function to perform this type unwrapping.
    
    ### Why are the changes needed?
    
    Fix a correctness issue.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I manually tested this change by re-running a workload which failed with a 
segfault prior to this patch. See JIRA for more details: 
https://issues.apache.org/jira/browse/SPARK-37784
    
    So far I have been unable to come up with a CI-runnable regression test 
which would have failed prior to this change (my only working reproduction runs 
in a pre-production environment and does not fail in my development 
environment).
    
    Closes #35066 from JoshRosen/SPARK-37784.
    
    Authored-by: Josh Rosen <joshro...@databricks.com>
    Signed-off-by: Josh Rosen <joshro...@databricks.com>
    (cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>
---
 .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 6e6b946..4092436 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -334,7 +334,7 @@ class CodegenContext extends Logging {
    */
   def addBufferedState(dataType: DataType, variableName: String, initCode: 
String): ExprCode = {
     val value = addMutableState(javaType(dataType), variableName)
-    val code = dataType match {
+    val code = UserDefinedType.sqlType(dataType) match {
       case StringType => code"$value = $initCode.clone();"
       case _: StructType | _: ArrayType | _: MapType => code"$value = 
$initCode.copy();"
       case _ => code"$value = $initCode;"

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to