[ 
https://issues.apache.org/jira/browse/SPARK-21768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16132191#comment-16132191
 ] 

Marco Gaido commented on SPARK-21768:
-------------------------------------

This is a duplicate of SPARK-17916.

> spark.csv.read Empty String Parsed as NULL when nullValue is Set
> ----------------------------------------------------------------
>
>                 Key: SPARK-21768
>                 URL: https://issues.apache.org/jira/browse/SPARK-21768
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.2, 2.2.0
>         Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2)
> PySpark
>            Reporter: Andrew Gross
>
> In a CSV with quoted fields, empty strings will be interpreted as NULL even 
> when a nullValue is explicitly set:
> Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX
> {{"XXNULLXX"|""|"XXNULLXX"|"foo"}}
> PySpark Script to load the file (from S3):
> {code:title=load.py|borderStyle=solid}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType, StructField, StructType
> spark = SparkSession.builder.appName("test_csv").getOrCreate()
> fields = []
> fields.append(StructField("First Null Field", StringType(), True))
> fields.append(StructField("Empty String Field", StringType(), True))
> fields.append(StructField("Second Null Field", StringType(), True))
> fields.append(StructField("Non Empty String Field", StringType(), True))
> schema = StructType(fields)
> keys = ['s3://mybucket/test/demo.csv']
> bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", 
> mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
> bad_data.show()
> {code}
> Output
> {noformat}
> +----------------+------------------+-----------------+----------------------+
> |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
> +----------------+------------------+-----------------+----------------------+
> |            null|              null|             null|                   foo|
> +----------------+------------------+-----------------+----------------------+
> {noformat}
> Expected Output:
> {noformat}
> +----------------+------------------+-----------------+----------------------+
> |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
> +----------------+------------------+-----------------+----------------------+
> |            null|                  |             null|                   foo|
> +----------------+------------------+-----------------+----------------------+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to