Mukul Murthy created SPARK-24438:
------------------------------------

             Summary: Empty strings and null strings are written to the same 
partition
                 Key: SPARK-24438
                 URL: https://issues.apache.org/jira/browse/SPARK-24438
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Mukul Murthy


When you partition on a string column that has empty strings and nulls, they 
are both written to the same default partition. When you read the data back, 
all those values get read back as null.


{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
null))
val schema = new StructType().add("a", IntegerType).add("b", StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
display(df) 
=> 
a b
1 
2 
3 
4 hello
5 null

df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
val df2 = spark.read.load("/home/mukul/weird_test_data4")
display(df2)
=> 
a b
4 hello
3 null
2 null
1 null
5 null
{code}

Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to