Ruslan Dautkhanov created SPARK-25251:
-----------------------------------------

             Summary: Make spark-csv's `quote` and `escape` options conform to 
RFC 4180
                 Key: SPARK-25251
                 URL: https://issues.apache.org/jira/browse/SPARK-25251
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.3.1, 2.3.0, 2.4.0, 3.0.0
            Reporter: Ruslan Dautkhanov


As described inĀ [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 -

{noformat}
   7. If double-quotes are used to enclose fields, then a double-quote 
appearing inside a field must be escaped by preceding it with another double 
quote
{noformat}

That's what Excel does, for example, by default.

Although in Spark (as of Spark 2.1), escaping is done by default through 
non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark 
to use doublequote to use for as an escape character:

{code}
.option('quote', '"') 
.option('escape', '"')
{code}

This may explain that a comma character wasn't interpreted as it was inside a 
quoted column.

So this is request to make spark-csv reader RFC-4180 compatible in regards to 
default option values for `quote` and `escape` (make both equal to " ).

Since this is a backward-incompatible change, Spark 3.0 might be a good release 
for this change.

Some more background - https://stackoverflow.com/a/45138591/470583 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to