Hi Mich,

You can use data frames (
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
to achieve that.

val sqlContext = new HiveContext(sc)

var rdd = sc.textFile("/data/stg/table2")

//...
//perform you business logic, cleanups, etc.
//...

sqlContext.createDataFrame(resultRdd).write.orc("..path..")

Please, note that resultRdd should contain Products (e.g. case classes)

Cheers, Alex.



On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

> Hi,
>
> We put csv files that are zipped using bzip into a staging are on hdfs
>
> In Hive an external table is created as below:
>
> DROP TABLE IF EXISTS stg_t2;
> CREATE EXTERNAL TABLE stg_t2 (
>  INVOICENUMBER string
> ,PAYMENTDATE string
> ,NET string
> ,VAT string
> ,TOTAL string
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/data/stg/table2'
> TBLPROPERTIES ("skip.header.line.count"="1")
>
> We have an ORC table in Hive created as below:
>
>
>
> DROP TABLE IF EXISTS t2;
> CREATE TABLE t2 (
>  INVOICENUMBER          INT
> ,PAYMENTDATE            timestamp
> ,NET                    DECIMAL(20,2)
> ,VAT                    DECIMAL(20,2)
> ,TOTAL                  DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> ;
>
> Then we insert the data from the external table into target table do some
> conversion and ignoring empty rows
>
> INSERT INTO TABLE t2
> SELECT
>           INVOICENUMBER
>         , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
>         --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
> FROM
> stg_t2
>
> This works OK for now.
>
>
>
> I was wondering whether this could be done using operations on rdd in
> Spark?
>
> var rdd = sc.textFile("/data/stg/table2")
>
> I can use rdd.count to see the total rows and rdd.collect.foreach(println)
> to see the individual rows
>
>
>
> I would like to get some ideas on how I can do CAST conversion etc on the
> data to clean it up and store it in the said ORC table?
>
>
>
> Thanks
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>

Reply via email to