Thomas Diesler created SPARK-29068: -------------------------------------- Summary: CSV read reports incorrect row count Key: SPARK-29068 URL: https://issues.apache.org/jira/browse/SPARK-29068 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Thomas Diesler
Reading the [SFNY example data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv] in Java like this ... {code:java} Path srcdir = Paths.get("src/test/resources"); Path inpath = srcdir.resolve("part_1_data.csv"); SparkSession session = getOrCreateSession(); Dataset<Row> dataset = session.read() //.option("header", true) .option("mode", "DROPMALFORMED") .schema(new StructType() .add("insf", IntegerType, false) .add("beds", DoubleType, false) .add("baths", DoubleType, false) .add("price", IntegerType, false) .add("year", IntegerType, false) .add("sqft", IntegerType, false) .add("prcsqft", IntegerType, false) .add("elevation", IntegerType, false)) .csv(inpath.toString()); {code} Incorrectly reports 495 instead of 492 rows. It seems to include the three header rows in the count. Also, without DROPMALFORMED it creates 495 rows with three null value rows. This also seems to be incorrect because the schema explicitly requires non null values for all fields. This code works fine with Spark-2.1.0 -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org