[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23649:
---
Shepherd: Herman van Hovell

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23649:
---
Attachment: utf8xFF.csv

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org