Re: question on the different way of RDD to dataframe

Mich Talebzadeh Tue, 08 Feb 2022 08:47:19 -0800

As Sean mentioned Scala case class  is a handy way of representing objects
with names and types. For example, if you are reading a csv file with
spaced column names like "counter party" etc and you want a more
compact column name like counterparty etc



scala> val location="hdfs://rhes75:9000/tmp/crap.csv"

location: String = hdfs://rhes75:9000/tmp/crap.csv

scala> val df1 = spark.read.option("header", false).csv(location)  // don't
read the header

df1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 34 more
fields]  // column header are represted as _c0, _c1 etc

scala> case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
PRICE: Double)  // create name and type for _c0, _c1 and so forth

defined class columns

scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
p(2).toString,p(3).toString.toDouble)) // map those columns

df2: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER: string
... 2 more fields]

scala> df2.printSchema

root

 |-- KEY: string (nullable = true)

 |-- TICKER: string (nullable = true)

 |-- TIMEISSUED: string (nullable = true)

 |-- PRICE: double (nullable = false)

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Feb 2022 at 14:32, Sean Owen <sro...@gmail.com> wrote:

> It's just a possibly tidier way to represent objects with named, typed
> fields, in order to specify a DataFrame's contents.
>
> On Tue, Feb 8, 2022 at 4:16 AM <capitnfrak...@free.fr> wrote:
>
>> Hello
>>
>> I am converting some py code to scala.
>> This works in python:
>>
>> >>> rdd = sc.parallelize([('apple',1),('orange',2)])
>> >>> rdd.toDF(['fruit','num']).show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple|  1|
>> |orange|  2|
>> +------+---+
>>
>> And in scala:
>> scala> rdd.toDF("fruit","num").show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple|  1|
>> |orange|  2|
>> +------+---+
>>
>> But I saw many code that use a case class for translation.
>>
>> scala> case class Fruit(fruit:String,num:Int)
>> defined class Fruit
>>
>> scala> rdd.map{case (x,y) => Fruit(x,y) }.toDF().show()
>> +------+---+
>> | fruit|num|
>> +------+---+
>> | apple|  1|
>> |orange|  2|
>> +------+---+
>>
>>
>> Do you know why to use a "case class" here?
>>
>> thanks.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: question on the different way of RDD to dataframe

Reply via email to