I have a simple people.csv and following SimpleApp

people.csv
----------
name,age
abc,22
xyz,32

********************************
Working Code
********************************
Object SimpleApp {}
  case class Person(name: String, age: Long)
  def main(args: Array[String]): Unit = {
    val spark = SparkFactory.getSparkSession("PIPE2Dataset")
    import spark.implicits._

    val peopleDS = spark.read.option("inferSchema","true").option("header",
"true").option("delimiter", ",").csv("/people.csv").as[Person]
}
********************************


********************************
Fails for data with no header
********************************
Removing header record "name,age" AND switching header option off
=>.option("header", "false") return error => *cannot resolve '`name`' given
input columns: [_c0, _c1]*
val peopleDS = spark.read.option("inferSchema","true").option("header",
"false").option("delimiter", ",").csv("/people.csv").as[Person]

Should'nt this just assing the header from Person class


********************************
invalid data
********************************
As I've specified *.as[Person]* which does schema inferance then
*"option("inferSchema","true")"
*is redundant and not needed!


And lastly does .as[Person] check that column value matches with data type
i.e. "age Long" would fail if it gets a non numeric value! because the
input file could be millions of row which could be very time consuming.

Reply via email to