Andrew Gross created SPARK-21768:
------------------------------------

             Summary: spark.csv.read Empty String Parsed as NULL when nullValue 
is Set
                 Key: SPARK-21768
                 URL: https://issues.apache.org/jira/browse/SPARK-21768
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.2.0, 2.0.2
         Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2)
PySpark

            Reporter: Andrew Gross


In a CSV with quoted fields, empty strings will be interpreted as NULL even 
when a nullValue is explicitly set:

Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX

{{"XXNULLXX"|""|"XXNULLXX"|"foo"}}

PySpark Script to load the file (from S3):

{code:title=load.py|borderStyle=solid}
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructField, StructType

spark = SparkSession.builder.appName("test_csv").getOrCreate()

fields = []
fields.append(StructField("First Null Field", StringType(), True))
fields.append(StructField("Empty String Field", StringType(), True))
fields.append(StructField("Second Null Field", StringType(), True))
fields.append(StructField("Non Empty String Field", StringType(), True))
schema = StructType(fields)

keys = ['s3://mybucket/test/demo.csv']

bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", 
mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
bad_data.show()
{code}

Output
{noformat}
+----------------+------------------+-----------------+----------------------+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
+----------------+------------------+-----------------+----------------------+
|            null|              null|             null|                   foo|
+----------------+------------------+-----------------+----------------------+
{noformat}

Expected Output:
{noformat}
+----------------+------------------+-----------------+----------------------+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
+----------------+------------------+-----------------+----------------------+
|            null|                  |             null|                   foo|
+----------------+------------------+-----------------+----------------------+
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to