Ruslan Dautkhanov created SPARK-25251: -----------------------------------------
Summary: Make spark-csv's `quote` and `escape` options conform to RFC 4180 Key: SPARK-25251 URL: https://issues.apache.org/jira/browse/SPARK-25251 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.1, 2.3.0, 2.4.0, 3.0.0 Reporter: Ruslan Dautkhanov As described inĀ [RFC-4180|https://tools.ietf.org/html/rfc4180], page 2 - {noformat} 7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote {noformat} That's what Excel does, for example, by default. Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character: {code} .option('quote', '"') .option('escape', '"') {code} This may explain that a comma character wasn't interpreted as it was inside a quoted column. So this is request to make spark-csv reader RFC-4180 compatible in regards to default option values for `quote` and `escape` (make both equal to " ). Since this is a backward-incompatible change, Spark 3.0 might be a good release for this change. Some more background - https://stackoverflow.com/a/45138591/470583 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org