[ https://issues.apache.org/jira/browse/SPARK-19488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-19488. --------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16834 [https://github.com/apache/spark/pull/16834] > CSV infer schema does not take into account Inf,-Inf,NaN > -------------------------------------------------------- > > Key: SPARK-19488 > URL: https://issues.apache.org/jira/browse/SPARK-19488 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.2 > Environment: Windows 10, SparkShell > Reporter: Shivam Dalmia > Assignee: Song Jun > Labels: easyfix, features > Fix For: 2.2.0 > > > I observed that while loading a CSV as a dataframe, user-specified values for > nanValue, positiveInf and negativeInf are disregarded when inferSchema = > true. (They work if a user-specified schema is provided). However, even the > spark defaults for the infinities (Inf and -Inf) do not work with > inferSchema. > Taking a look at the source code for the inferSchema for CSV > (CSVInferSchema.scala), I found the following code snippet. > {code} > 1. private def tryParseDouble(field: String, options: CSVOptions): > DataType = { > 2. if ((allCatch opt field.toDouble).isDefined) { > 3. DoubleType > 4. } else { > 5. tryParseTimestamp(field, options) > 6. } > 7. } > 8. > 9. private def tryParseTimestamp(field: String, options: > CSVOptions): DataType = { > 10. // This case infers a custom `dataFormat` is set. > 11. if ((allCatch opt > options.timestampFormat.parse(field)).isDefined) { > 12. TimestampType > 13. } else if ((allCatch opt > DateTimeUtils.stringToTime(field)).isDefined) { > 14. // We keep this for backwords competibility. > 15. TimestampType > 16. } else { > 17. tryParseBoolean(field, options) > 18. } > 19. } > {code} > Interestingly, the user-specified csv options are not at all used while > determining if the field is of type double (as we can see in line 2). We can > see that the options is used for timestamp type (line 11), which is why the > 'dateFormat' option does work. > However, when the field is NaN, it works because scala's toDouble function > does convert the string NaN to the double equivalent of NaN. (I tried it > using the shell): > {code} > scala> allCatch.opt(field.toDouble) > res12: Option[Double] = Some(8.0942) > scala> var field = "NaN"; > field: String = NaN > scala> allCatch.opt(field.toDouble) > res13: Option[Double] = Some(NaN) > scala> var field = "Inf"; > field: String = Inf > scala> allCatch.opt(field.toDouble) > res14: Option[Double] = None > {code} > Interestingly, scala does have Double equivalents of Infinity and -Infinity > (but spark defaults are Inf and -Inf, which is why they don't work): > {code} > scala> field = "Infinity"; > field: String = Infinity > scala> allCatch.opt(field.toDouble) > res15: Option[Double] = Some(Infinity) > scala> field = "-Infinity"; > field: String = -Infinity > scala> allCatch.opt(field.toDouble) > res16: Option[Double] = Some(-Infinity) > {code} > The following csv, when ingested with inferSchema = true, therefore > interprets the value column as a Double! (Regardless of the user-specified > options) > {code} > ID,name,value,irrational,prime,real > 1,e,2.7,true,false,true > 2,pi,3.14,true,false,true > 3,inf,Infinity,false,false,true > 4,-inf,-Infinity,false,false,true > 5,i,NaN,false,false,false > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org