[ https://issues.apache.org/jira/browse/SPARK-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-16460. ------------------------------- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14118 [https://github.com/apache/spark/pull/14118] > Spark 2.0 CSV ignores NULL value in Date format > ----------------------------------------------- > > Key: SPARK-16460 > URL: https://issues.apache.org/jira/browse/SPARK-16460 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Environment: SparkR > Reporter: Marcel Boldt > Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Trying to read a CSV file to Spark (using SparkR) containing just this data > row: > {code} > 1|1998-01-01|| > {code} > Using Spark 1.6.2 (Hadoop 2.6) gives me > {code} > > head(sdf) > id d dtwo > 1 1 1998-01-01 NA > {code} > Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error: > {panel} > > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.text.ParseException: Unparseable date: "" > at java.text.DateFormat.parse(DateFormat.java:357) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Itera... > {panel} > The problem seems indeed the NULL value here as with a valid date in the > third CSV column it works. > R code: > {code} > #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') > Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7') > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > library(SparkR) > > sc <- > sparkR.init( > master = "local", > sparkPackages = "com.databricks:spark-csv_2.11:1.4.0" > ) > sqlContext <- sparkRSQL.init(sc) > > > st <- structType(structField("id", "integer"), structField("d", "date"), > structField("dtwo", "date")) > > sdf <- read.df( > sqlContext, > path = "d:/date_test.csv", > source = "com.databricks.spark.csv", > schema = st, > inferSchema = "false", > delimiter = "|", > dateFormat = "yyyy-MM-dd", > nullValue = "", > mode = "PERMISSIVE" > ) > > head(sdf) > > sparkR.stop() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org