Robert Joseph Evans created SPARK-35563: -------------------------------------------
Summary: [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows Key: SPARK-35563 URL: https://issues.apache.org/jira/browse/SPARK-35563 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2 Reporter: Robert Joseph Evans I think this impacts a lot more versions of Spark, but I don't know for sure because it takes a long time to test. As a part of doing corner case validation testing for spark rapids I found that if a window function has more than {{Int.MaxValue + 1}} rows the result is silently truncated to that many rows. I have only tested this on 3.0.2 with {{row_number}}, but I suspect it will impact others as well. This is a really rare corner case, but because it is silent data corruption I personally think it is quite serious. {code:scala} import org.apache.spark.sql.expressions.Window val windowSpec = Window.partitionBy("a").orderBy("b") val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as b") spark.time(df.select(col("a"), col("b"), row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20)) +-----+----------+ | dir| count| +-----+----------+ |false|2147483647| | true| 1| +-----+----------+ Time taken: 1139089 ms Int.MaxValue.toLong + 100 res15: Long = 2147483747 2147483647L + 1 res16: Long = 2147483648 {code} I had to make sure that I ran the above with at least 64GiB of heap for the executor (I did it in local mode and it worked, but took forever to run) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org