Re: reading csv file, operation on column or columns

Mich Talebzadeh Sun, 20 Mar 2016 16:07:34 -0700

Apologies. Good point

def convertColumn(df: org.apache.spark.sql.DataFrame, name:String,
newType:String) = {
     | val df_1 = df.withColumnRenamed(name, "ConvertColumn")
     | df_1.withColumn(name,
df_1.col("ConvertColumn").cast(newType)).drop("ConvertColumn")
     | }
val df_3 = convertColumn(df_2, "InvoiceNumber","Integer")
df_3: org.apache.spark.sql.DataFrame = [Payment date: string, Net: string,
VAT: string, InvoiceNumber: int]


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 March 2016 at 22:48, Ted Yu <[email protected]> wrote:

> Mich:
> Looks like convertColumn() is method of your own - I don't see it in Spark
> code base.
>
> On Sun, Mar 20, 2016 at 3:38 PM, Mich Talebzadeh <
> [email protected]> wrote:
>
>> Pretty straight forward as pointed out by Ted.
>>
>> --read csv file into a df
>> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header", "true").load("/data/stg/table2")
>>
>> scala> df.printSchema
>> root
>>  |-- Invoice Number: string (nullable = true)
>>  |-- Payment date: string (nullable = true)
>>  |-- Net: string (nullable = true)
>>  |-- VAT: string (nullable = true)
>>  |-- Total: string (nullable = true)
>> --
>> --rename the first column as InvoiceNumber getting rid of space
>> --
>> scala> val df_1 = df.withColumnRenamed("Invoice Number","InvoiceNumber")
>> df_1: org.apache.spark.sql.DataFrame = [InvoiceNumber: string, Payment
>> date: string, Net: string, VAT: string, Total: string]
>> --
>> --drop column Total
>> --
>> scala> val df_2 = df_1.drop("Total")
>> df_2: org.apache.spark.sql.DataFrame = [InvoiceNumber: string, Payment
>> date: string, Net: string, VAT: string]
>> --
>> -- Change InvoiceNumber from String to Integer
>> --
>> scala> val df_3 = convertColumn(df_2, "InvoiceNumber","Integer")
>> df_3: org.apache.spark.sql.DataFrame = [Payment date: string, Net:
>> string, VAT: string, InvoiceNumber: int]
>>
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 20 March 2016 at 22:15, Ted Yu <[email protected]> wrote:
>>
>>> Please refer to the following methods of DataFrame:
>>>
>>>   def withColumn(colName: String, col: Column): DataFrame = {
>>>
>>>   def drop(colName: String): DataFrame = {
>>>
>>> On Sun, Mar 20, 2016 at 2:47 PM, Ashok Kumar <
>>> [email protected]> wrote:
>>>
>>>> Gurus,
>>>>
>>>> I would like to read a csv file into a Data Frame but able to rename
>>>> the column name, change a column type from String to Integer or drop the
>>>> column from further analysis before saving data as parquet file?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: reading csv file, operation on column or columns

Reply via email to