[jira] [Created] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

Robert Joseph Evans (Jira) Sat, 29 May 2021 08:54:04 -0700

Robert Joseph Evans created SPARK-35563:
-------------------------------------------


             Summary: [SQL] Window operations with over Int.MaxValue + 1 rows 
can silently drop rows
                 Key: SPARK-35563
                 URL: https://issues.apache.org/jira/browse/SPARK-35563
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.2
            Reporter: Robert Joseph Evans


I think this impacts a lot more versions of Spark, but I don't know for sure 
because it takes a long time to test. As a part of doing corner case validation 
testing for spark rapids I found that if a window function has more than 
{{Int.MaxValue + 1}} rows the result is silently truncated to that many rows. I 
have only tested this on 3.0.2 with {{row_number}}, but I suspect it will 
impact others as well. This is a really rare corner case, but because it is 
silent data corruption I personally think it is quite serious.
{code:scala}
import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy("a").orderBy("b")

val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as b")

spark.time(df.select(col("a"), col("b"), 
row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))

+-----+----------+                                                              
|  dir|     count|
+-----+----------+
|false|2147483647|
| true|         1|
+-----+----------+

Time taken: 1139089 ms

Int.MaxValue.toLong + 100
res15: Long = 2147483747

2147483647L + 1
res16: Long = 2147483648
{code}
I had to make sure that I ran the above with at least 64GiB of heap for the 
executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

Reply via email to