Re: question on the different way of RDD to dataframe

frakass Tue, 08 Feb 2022 18:04:42 -0800

I know that using case class I can control the data type strictly.


scala> val rdd = sc.parallelize(List(("apple",1),("orange",2)))

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0]at parallelize at <console>:23


scala> rdd.toDF.printSchema
root
 |-- _1: string (nullable = true)
 |-- _2: integer (nullable = false)


I can specify the second column to other type such as Double by case class:

scala> rdd.map{ case (x,y) => Fruit(x,y) }.toDF.printSchema
root
 |-- fruit: string (nullable = true)
 |-- num: double (nullable = false)



Thank you.



On 2022/2/8 10:32, Sean Owen wrote:

It's just a possibly tidier way to represent objects with named, typedfields, in order to specify a DataFrame's contents.


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: question on the different way of RDD to dataframe

Reply via email to