[jira] [Commented] (SPARK-28733) DataFrameReader of Spark not able to recognize the very first quote character, while custom unicode quote character is used

Christian Hollinger (JIRA) Thu, 15 Aug 2019 15:11:30 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-28733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908512#comment-16908512
 ]


Christian Hollinger commented on SPARK-28733:
---------------------------------------------

I can confirm this issue on PopOS + OpenJDK 8 with 2.3.3.

It appears to be fixed on the master branch:

 
{code:java}
val sepChar = "\u00C7" // Ç
val quoteChar = "\u1E1C" // Ḝ
val escapeChar = "\u1E1D" // ḝ
val inputCsvFile = "/home/christian/workspace/tests/testfile.csv"
val encoding = "utf-8"
val readDF = spark.read.option("sep", sepChar).option("encoding", 
encoding.toUpperCase).option("quote", quoteChar).option("escape", 
escapeChar).option("header", "false").option("multiLine", 
"true").csv(inputCsvFile)
readDF.cache()
readDF.show(20, false)

+---+------------+---+
|_c0|_c1         |_c2|
+---+------------+---+
|1  |smith       |5  |
|2  |douson      |6  |
|3  |sr,tendulkar|10 |
+---+------------+---+
{code}
It appears the call to com.univocity.parsers.csv has been updated in 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource and now passes the 
encoding correctly.

 

> DataFrameReader of Spark not able to recognize the very first quote 
> character, while custom unicode quote character is used
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28733
>                 URL: https://issues.apache.org/jira/browse/SPARK-28733
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: Mrinal Bhattacherjee
>            Priority: Critical
>              Labels: DataFrameReader, SparkCSV, dataframe
>
> I have encountered a strange behaviour recently, while reading a CSV file 
> using DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). 
> Here is my spark read code snippet.
> _{color:#d04437}val sepChar = "\u00C7" // Ç{color}_
> _{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_
> _{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_
> _{color:#d04437}val inputCsvFile = 
> "<some-local-windows-path>\\input_ab.csv"{color}_
>  
> _{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_
>  _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_
>  _{color:#d04437}.option("quote", quoteChar){color}_
>  _{color:#d04437}.option("escape", escapeChar){color}_
>  _{color:#d04437}.option("header", "false"){color}_
>  _{color:#d04437}.option("multiLine", "true"){color}_
>  _{color:#d04437}.csv(inputCsvFile){color}_
>  _{color:#d04437}readDF.cache(){color}_
>  _{color:#d04437}readDF.show(20, false){color}_
> Due to some awful data, I'm forced to use some unicode characters as sep 
> character, quote character, escape character instead of default ones. Below 
> is my input sample data.
> {color:#333333}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color}
> {color:#333333}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color}
> {color:#333333}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color}
> Here Ç is field separator, Ḝ is quote character and all the fields values are 
> wrapped with this custom quote character.
> The problem I'm getting is, the first occurance of the quote character is not 
> getting recognized by Spark somehow. I tried with any charcter other than 
> Unicode like ` ~ X (alphabet X just for a testing scenario), even default 
> quote (") as well. It works fine in all the scenarios except when Unicode is 
> used as quote character. The first occurance of the Unicode quote character 
> is coming as some non printable character �� , hence the wrap end quote 
> character of the first field of first record is getting included in data.
> Here is the output of df show.
> +---+------------+-----+
> |id |name |class|
> +---+------------+-----+
> |��1Ḝ |smith |5 |
> |2 |douson |6 |
> |3 |sr,tendulkar|10 |
> +---+------------+-----+
> It happens only for the first field of the very first record. Other quote 
> characters in this file is being read as expected without any issues. When I 
> keep an extra empty record at the top of the file, i.e., simply a new line 
> (\n) at the very first line, the issue doesn't occur. Even, that empty row is 
> not being considered as an empty record in df as well. Thus my problem gets 
> solved. But this manipulation cannot be done in the production and hence it 
> is an issue to be bothered about.
> I feel, this is a bug. If it is not, kindly let me know the way to process 
> the same without getting this issue; or else kindly provide a fix at the 
> earliest. Thanks in advance.
> Best Regards,
> Mrinal



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28733) DataFrameReader of Spark not able to recognize the very first quote character, while custom unicode quote character is used

Reply via email to