Robert Joseph Evans created SPARK-32672:
-------------------------------------------

             Summary: Daat corruption in some cached compressed boolean columns
                 Key: SPARK-32672
                 URL: https://issues.apache.org/jira/browse/SPARK-32672
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Robert Joseph Evans
         Attachments: bad_order.snappy.parquet

I found that when sorting some boolean data into the cache that the results can 
change when the data is read back out.

It needs to be a non-trivial amount of data, and it is highly dependent on the 
order of the data.  If I disable compression in the cache the issue goes away.  
I was able to make this happen in 3.0.0.  I am going to try and reproduce it in 
other versions too.

I'll attach the parquet file with boolean data in an order that causes this to 
happen. As you can see after the data is cached a single null values switches 
over to be false.

{code}
scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
bad_order: org.apache.spark.sql.DataFrame = [b: boolean]                        

scala> bad_order.groupBy("b").count.show
+-----+-----+
|    b|count|
+-----+-----+
| null| 7153|
| true|54334|
|false|54021|
+-----+-----+


scala> bad_order.cache()
res1: bad_order.type = [b: boolean]

scala> bad_order.groupBy("b").count.show
+-----+-----+
|    b|count|
+-----+-----+
| null| 7152|
| true|54334|
|false|54022|
+-----+-----+


scala> 

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to