Re: Dataset Type safety

2017-01-10 Thread Michael Armbrust
>
> As I've specified *.as[Person]* which does schema inferance then
> *"option("inferSchema","true")" *is redundant and not needed!


The resolution of fields is done by name, not by position for case
classes.  This is what allows us to support more complex things like JSON
or nested structures.  If you you just want to map it by position you can
do .as[(String, Long)] to map it to a tuple instead.

And lastly does .as[Person] check that column value matches with data type
> i.e. "age Long" would fail if it gets a non numeric value! because the
> input file could be millions of row which could be very time consuming.


No, this is a static check based on the schema.  It does not scan the data
(though schema inference does).

On Tue, Jan 10, 2017 at 11:34 AM, A Shaikh  wrote:

> I have a simple people.csv and following SimpleApp
>
>
> people.csv
> --
> name,age
> abc,22
> xyz,32
>
> 
> Working Code
> 
> Object SimpleApp {}
>   case class Person(name: String, age: Long)
>   def main(args: Array[String]): Unit = {
> val spark = SparkFactory.getSparkSession("PIPE2Dataset")
> import spark.implicits._
>
> val peopleDS = spark.read.option("inferSchema","true").option("header",
> "true").option("delimiter", ",").csv("/people.csv").as[Person]
> }
> 
>
>
> 
> Fails for data with no header
> 
> Removing header record "name,age" AND switching header option off
> =>.option("header", "false") return error => *cannot resolve '`name`'
> given input columns: [_c0, _c1]*
> val peopleDS = spark.read.option("inferSchema","true").option("header",
> "false").option("delimiter", ",").csv("/people.csv").as[Person]
>
> Should'nt this just assing the header from Person class
>
>
> 
> invalid data
> 
> As I've specified *.as[Person]* which does schema inferance then 
> *"option("inferSchema","true")"
> *is redundant and not needed!
>
>
> And lastly does .as[Person] check that column value matches with data type
> i.e. "age Long" would fail if it gets a non numeric value! because the
> input file could be millions of row which could be very time consuming.
>


Dataset Type safety

2017-01-10 Thread A Shaikh
I have a simple people.csv and following SimpleApp


people.csv
--
name,age
abc,22
xyz,32


Working Code

Object SimpleApp {}
  case class Person(name: String, age: Long)
  def main(args: Array[String]): Unit = {
val spark = SparkFactory.getSparkSession("PIPE2Dataset")
import spark.implicits._

val peopleDS = spark.read.option("inferSchema","true").option("header",
"true").option("delimiter", ",").csv("/people.csv").as[Person]
}




Fails for data with no header

Removing header record "name,age" AND switching header option off
=>.option("header", "false") return error => *cannot resolve '`name`' given
input columns: [_c0, _c1]*
val peopleDS = spark.read.option("inferSchema","true").option("header",
"false").option("delimiter", ",").csv("/people.csv").as[Person]

Should'nt this just assing the header from Person class



invalid data

As I've specified *.as[Person]* which does schema inferance then
*"option("inferSchema","true")"
*is redundant and not needed!


And lastly does .as[Person] check that column value matches with data type
i.e. "age Long" would fail if it gets a non numeric value! because the
input file could be millions of row which could be very time consuming.