[ https://issues.apache.org/jira/browse/SPARK-31479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091193#comment-17091193 ]
Hyukjin Kwon commented on SPARK-31479: -------------------------------------- Use locale option. See SPARK-25945 > Numbers with thousands separator or locale specific decimal separator not > parsed correctly > ------------------------------------------------------------------------------------------ > > Key: SPARK-31479 > URL: https://issues.apache.org/jira/browse/SPARK-31479 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.5 > Reporter: Ranjit Iyer > Priority: Major > > CSV files that contain numbers with thousands separator (or locale specific > decimal separators) are not parsed correctly and are reported as {{null.}} > [https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html] > A user in France might expect "10,100" to be parsed as a float while a user > in the US might want Spark to interpret it as an Integer value (10100). > UnivocityParser is not locale aware and must use NumberFormatter to parse > string values to Numbers. > *US Locale* > {{scala>Source.fromFile("/Users/ranjit.iyer/work/data/us.csv").getLines.mkString("\n")}} > {{res28: String =}} > {{"Value"}} > {{"10,000"}} > {{"20,000"}} > {{scala> Locale.setDefault(Locale.US)}} > {{scala> val _schema = StructType(StructField("Value", IntegerType, true) :: > Nil)}} > {{scala> val df = spark.read.format("csv").option("header", > "true").schema(_schema).load("/Users/ranjit.iyer/work/data/us.csv")}} > {{df: org.apache.spark.sql.DataFrame = [Value: int]}} > {{scala> df.show}} > {{+-----+}} > {{|Value|}} > {{+-----+}} > {{| null|}} > {{| null|}} > {{+-----+}} > *French Local* > {{scala> > Source.fromFile("/Users/ranjit.iyer/work/data/fr.csv").getLines.mkString("\n")}} > {{res43: String =}} > {{"Value"}} > {{"10,123"}} > {{"20,456"}} > {{scala> Locale.setDefault(Locale.FRANCE)}} > {{scala> val _schema = StructType(StructField("Value", FloatType, true) :: > Nil)}} > {{scala> val df = spark.read.format("csv").option("header", > "true").schema(_schema).load("/Users/ranjit.iyer/work/data/fr.csv")}} > {{df: org.apache.spark.sql.DataFrame = [Value: float]}} > {{scala> df.show}} > {{+-----+}} > {{|Value|}} > {{+-----+}} > {{| null|}} > {{| null|}} > {{+-----+}} > The fix is to use a NumberFormatter and I have it working locally and will > raise a PR for review. > {{NumberFormat.getInstance.parse(_).intValue()}} > Thousands separator are quite commonly found on the internet. My workflow has > been to copy to Excel, export to csv and analyze in Spark. > [https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org