I know that using case class I can control the data type strictly.
scala> val rdd = sc.parallelize(List(("apple",1),("orange",2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0]
at parallelize at <console>:23
scala> rdd.toDF.printSchema
root
|-- _1: string (nullable = true)
|-- _2: integer (nullable = false)
I can specify the second column to other type such as Double by case class:
scala> rdd.map{ case (x,y) => Fruit(x,y) }.toDF.printSchema
root
|-- fruit: string (nullable = true)
|-- num: double (nullable = false)
Thank you.
On 2022/2/8 10:32, Sean Owen wrote:
It's just a possibly tidier way to represent objects with named, typed
fields, in order to specify a DataFrame's contents.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org