I know that using case class I can control the data type strictly.

scala> val rdd = sc.parallelize(List(("apple",1),("orange",2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:23

scala> rdd.toDF.printSchema
root
 |-- _1: string (nullable = true)
 |-- _2: integer (nullable = false)


I can specify the second column to other type such as Double by case class:

scala> rdd.map{ case (x,y) => Fruit(x,y) }.toDF.printSchema
root
 |-- fruit: string (nullable = true)
 |-- num: double (nullable = false)



Thank you.



On 2022/2/8 10:32, Sean Owen wrote:
It's just a possibly tidier way to represent objects with named, typed fields, in order to specify a DataFrame's contents.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to