[ https://issues.apache.org/jira/browse/SPARK-28733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-28733. ---------------------------------- Resolution: Cannot Reproduce Resolving this. It would be nicer if somebody identifies the JIRA that fixed this issue and see if we can backport. > DataFrameReader of Spark not able to recognize the very first quote > character, while custom unicode quote character is used > --------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-28733 > URL: https://issues.apache.org/jira/browse/SPARK-28733 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.2 > Reporter: Mrinal Bhattacherjee > Priority: Major > Labels: DataFrameReader, SparkCSV, dataframe > > I have encountered a strange behaviour recently, while reading a CSV file > using DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). > Here is my spark read code snippet. > _{color:#d04437}val sepChar = "\u00C7" // Ç{color}_ > _{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_ > _{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_ > _{color:#d04437}val inputCsvFile = > "<some-local-windows-path>\\input_ab.csv"{color}_ > > _{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_ > _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_ > _{color:#d04437}.option("quote", quoteChar){color}_ > _{color:#d04437}.option("escape", escapeChar){color}_ > _{color:#d04437}.option("header", "false"){color}_ > _{color:#d04437}.option("multiLine", "true"){color}_ > _{color:#d04437}.csv(inputCsvFile){color}_ > _{color:#d04437}readDF.cache(){color}_ > _{color:#d04437}readDF.show(20, false){color}_ > Due to some awful data, I'm forced to use some unicode characters as sep > character, quote character, escape character instead of default ones. Below > is my input sample data. > {color:#333333}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color} > {color:#333333}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color} > {color:#333333}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color} > Here Ç is field separator, Ḝ is quote character and all the fields values are > wrapped with this custom quote character. > The problem I'm getting is, the first occurance of the quote character is not > getting recognized by Spark somehow. I tried with any charcter other than > Unicode like ` ~ X (alphabet X just for a testing scenario), even default > quote (") as well. It works fine in all the scenarios except when Unicode is > used as quote character. The first occurance of the Unicode quote character > is coming as some non printable character �� , hence the wrap end quote > character of the first field of first record is getting included in data. > Here is the output of df show. > {code} > +---+------------+-----+ > |id |name |class| > +---+------------+-----+ > |��1Ḝ |smith |5 | > |2 |douson |6 | > |3 |sr,tendulkar|10 | > +---+------------+-----+ > {code} > It happens only for the first field of the very first record. Other quote > characters in this file is being read as expected without any issues. When I > keep an extra empty record at the top of the file, i.e., simply a new line > (\n) at the very first line, the issue doesn't occur. Even, that empty row is > not being considered as an empty record in df as well. Thus my problem gets > solved. But this manipulation cannot be done in the production and hence it > is an issue to be bothered about. > I feel, this is a bug. If it is not, kindly let me know the way to process > the same without getting this issue; or else kindly provide a fix at the > earliest. Thanks in advance. > Best Regards, > Mrinal -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org