[ 
https://issues.apache.org/jira/browse/SPARK-31479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31479.
----------------------------------
    Resolution: Duplicate

> Numbers with thousands separator or locale specific decimal separator not 
> parsed correctly
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31479
>                 URL: https://issues.apache.org/jira/browse/SPARK-31479
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.5
>            Reporter: Ranjit Iyer
>            Priority: Major
>
> CSV files that contain numbers with thousands separator (or locale specific 
> decimal separators) are not parsed correctly and are reported as {{null.}}
> [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html]
> A user in France might expect "10,100" to be parsed as a float while a user 
> in the US might want Spark to interpret it as an Integer value (10100). 
> UnivocityParser is not locale aware and must use NumberFormatter to parse 
> string values to Numbers. 
> *US Locale*
> {{scala>Source.fromFile("/Users/ranjit.iyer/work/data/us.csv").getLines.mkString("\n")}}
>  {{res28: String =}}
>  {{"Value"}}
>  {{"10,000"}}
>  {{"20,000"}}
> {{scala> Locale.setDefault(Locale.US)}}
> {{scala> val _schema = StructType(StructField("Value", IntegerType, true) :: 
> Nil)}}
> {{scala> val df = spark.read.format("csv").option("header", 
> "true").schema(_schema).load("/Users/ranjit.iyer/work/data/us.csv")}}
>  {{df: org.apache.spark.sql.DataFrame = [Value: int]}}
> {{scala> df.show}}
>  {{+-----+}}
>  {{|Value|}}
>  {{+-----+}}
>  {{| null|}}
>  {{| null|}}
>  {{+-----+}}
> *French Local* 
> {{scala> 
> Source.fromFile("/Users/ranjit.iyer/work/data/fr.csv").getLines.mkString("\n")}}
>  {{res43: String =}}
>  {{"Value"}}
>  {{"10,123"}}
>  {{"20,456"}}
> {{scala> Locale.setDefault(Locale.FRANCE)}}
> {{scala> val _schema = StructType(StructField("Value", FloatType, true) :: 
> Nil)}}
> {{scala> val df = spark.read.format("csv").option("header", 
> "true").schema(_schema).load("/Users/ranjit.iyer/work/data/fr.csv")}}
>  {{df: org.apache.spark.sql.DataFrame = [Value: float]}}
> {{scala> df.show}}
>  {{+-----+}}
>  {{|Value|}}
>  {{+-----+}}
>  {{| null|}}
>  {{| null|}}
>  {{+-----+}}
> The fix is to use a NumberFormatter and I have it working locally and will 
> raise a PR for review.
> {{NumberFormat.getInstance.parse(_).intValue()}} 
> Thousands separator are quite commonly found on the internet. My workflow has 
> been to copy to Excel, export to csv and analyze in Spark.
> [https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to