[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217875#comment-17217875 ] Walt Elder commented on SPARK-22236: [~snemmal], try: {quote}.option("quote", """){quote} {quote}.option("escape", """){quote} > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975776#comment-16975776 ] Santhosh commented on SPARK-22236: -- The code mentioned above spark.read.option('escape', '"').csv('testfile.csv').collect() Is erroring out with "RuntimeException: escape cannot be more than one character" Not sure why " (double quote) is considered as more than one character. Please help/suggest/correct. (My Spark version 2.4.3) > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582933#comment-16582933 ] Steve Loughran commented on SPARK-22236: bq. But is the implication that we can never change the default behavior to match the standard, even in a major revision, because of the impact on reading previously written data? I don't see how that would be sensible. I'm not sure about the spark compatibility guidelines, but given how CSV is such a vague "standard", it's invariably a troublespot. You could always argue for having a new format "csv-standard" which parses CSV with escapes and multiline & so the different semantics. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582930#comment-16582930 ] Steve Loughran commented on SPARK-22236: bq. One can always repartition after reading, and if the input is large enough to benefit substantially from parsing in parallel, it can be (likely will be?) split into multiple files before reading. Assumes you are in control of the incoming CSVs, not saying doing that first read as an ETL workflow of CSV files of a few GB. After that, I'd go not just for a repartition but saving in a better format bq. Does the benefit of reading single files in parallel by default outweigh the cost of silently discarding standard-compliant data by default? Not happy about the silently discarding, but it is the change in behaviour I'm worried about. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580229#comment-16580229 ] Joe Pallas commented on SPARK-22236: {quote}people with preexisting datasets exported by Spark would suffer (unexpected) data loss {quote} Yes, that is an important consideration. But is the implication that we can never change the default behavior to match the standard, *even in a major revision*, because of the impact on reading previously written data? I don't see how that would be sensible. So, how can we minimize the risk associated with making this change? * Documentation is certainly one way, in both the API doc and the release notes. * We could add a {{spark2-compat}} option (equivalent to setting escape, but the name emphasizes the compatibility semantics). * We could flag a warning or error if we see {{\"}} in the input. Since that could appear in legal RFC4180 input, we might then want an option to suppress the warning/error when we know the input was not generated with the old settings. I've no idea how difficult the implementation would be on that last one. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579974#comment-16579974 ] Ondrej Kokes commented on SPARK-22236: -- Multiline=true by default would cause some slowdown, but data quality would either increase or stay the same - it would never go down. So the discussion there is mostly about performance. With escape changes, while we would see improvements in data quality on the input side, *people with preexisting datasets exported by Spark would suffer (unexpected) data loss,* because the escaping strategy would potentially differ from the time the data was written. I think that's a bit more important aspect to consider. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577063#comment-16577063 ] Joe Pallas commented on SPARK-22236: One can always repartition after reading, and if the input is large enough to benefit substantially from parsing in parallel, it can be (likely will be?) split into multiple files before reading. Does the benefit of reading single files in parallel by default outweigh the cost of silently discarding standard-compliant data by default? > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575629#comment-16575629 ] Steve Loughran commented on SPARK-22236: I wouldn't recommend changing multiline=true by default precisely because the change is so traumatic: you'd never be able to partition a CSV file across >1 worker > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575305#comment-16575305 ] Joe Pallas commented on SPARK-22236: If this can't change before 3.0, how about a note in the documentation explaining that compatibility with RFC4180 requires setting {{escape}} to {{"}} and {{multiLine}} to {{true}}? > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481655#comment-16481655 ] Ondrej Kokes commented on SPARK-22236: -- There is one more setting that makes the default CSV parser in Spark in violation of RFC 4180, that's the multiLine setting. It defaults to false, so newlines are treated as row separators, but CSVs allow for newlines in fields, as long as the field is enclosed in double quotes. Sadly, I think setting multiLine to true by default is less feasible than changing the escape setting, because multiLine=false makes the parser easily parallelisable while still parsing the majority of CSV data correctly. But, combined with mode=PERMISSIVE, this setting makes the default parser a landmine. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199869#comment-16199869 ] Ondrej Kokes commented on SPARK-22236: -- Thanks for the comments! I originally forgot to note [where the escape character was set|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], I've edited the ticket to include it. There are a few more times where the escape character is changed explicitly, but that's in tests: {code} $ rg -t scala '"escape",' sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala 91: val escape = getChar("escape", '\\') sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala 419:.option("escape", "\"") 445:.option("escape", "\"") 471:.option("escape", "\"") {code} > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199825#comment-16199825 ] Hyukjin Kwon commented on SPARK-22236: -- Yup, what is said ^ looks all correct. I think we should set all the default as are in Univocity in the future if possible. They happen to be set differently in the first place. In a quick look, the root cause is due to {{escape}} being {{}} in Spark where Univocity parser has {{"}} by default. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180
[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199245#comment-16199245 ] Sean Owen commented on SPARK-22236: --- Interesting, because the Univocity parser internally seems to default to RFC4180 settings. But the Spark implementation default overrides this with a default of {{\}}. [~hyukjin.kwon] was that for backwards compatibility with previous implementations? In any event I'm not sure we'd change the default behavior on this side of Spark 3.x, but, you can easily configure the writer to use double-quote for escape. > CSV I/O: does not respect RFC 4180 > -- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Ondrej Kokes >Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org