Thomas Diesler created SPARK-29068:
--------------------------------------

             Summary: CSV read reports incorrect row count
                 Key: SPARK-29068
                 URL: https://issues.apache.org/jira/browse/SPARK-29068
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4
            Reporter: Thomas Diesler


Reading the [SFNY example 
data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv] 
in Java like this ...
{code:java}
        Path srcdir = Paths.get("src/test/resources");
        Path inpath = srcdir.resolve("part_1_data.csv");

        SparkSession session = getOrCreateSession();
        Dataset<Row> dataset = session.read()
                        //.option("header", true)
                        .option("mode", "DROPMALFORMED")
                        .schema(new StructType()
                                .add("insf", IntegerType, false)
                                .add("beds", DoubleType, false)
                                .add("baths", DoubleType, false)
                                .add("price", IntegerType, false)
                                .add("year", IntegerType, false)
                                .add("sqft", IntegerType, false)
                                .add("prcsqft", IntegerType, false)
                                .add("elevation", IntegerType, false))
                        .csv(inpath.toString());
{code}
Incorrectly reports 495 instead of 492 rows. It seems to include the three 
header rows in the count.

Also, without DROPMALFORMED it creates 495 rows with three null value rows. 
This also seems to be incorrect because the schema explicitly requires non null 
values for all fields.

This code works fine with Spark-2.1.0



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to