Hi Mich, You can use data frames ( http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes) to achieve that.
val sqlContext = new HiveContext(sc) var rdd = sc.textFile("/data/stg/table2") //... //perform you business logic, cleanups, etc. //... sqlContext.createDataFrame(resultRdd).write.orc("..path..") Please, note that resultRdd should contain Products (e.g. case classes) Cheers, Alex. On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Hi, > > We put csv files that are zipped using bzip into a staging are on hdfs > > In Hive an external table is created as below: > > DROP TABLE IF EXISTS stg_t2; > CREATE EXTERNAL TABLE stg_t2 ( > INVOICENUMBER string > ,PAYMENTDATE string > ,NET string > ,VAT string > ,TOTAL string > ) > COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology' > ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > STORED AS TEXTFILE > LOCATION '/data/stg/table2' > TBLPROPERTIES ("skip.header.line.count"="1") > > We have an ORC table in Hive created as below: > > > > DROP TABLE IF EXISTS t2; > CREATE TABLE t2 ( > INVOICENUMBER INT > ,PAYMENTDATE timestamp > ,NET DECIMAL(20,2) > ,VAT DECIMAL(20,2) > ,TOTAL DECIMAL(20,2) > ) > COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology' > STORED AS ORC > TBLPROPERTIES ( "orc.compress"="ZLIB" ) > ; > > Then we insert the data from the external table into target table do some > conversion and ignoring empty rows > > INSERT INTO TABLE t2 > SELECT > INVOICENUMBER > , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp) > --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2)) > , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2)) > , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2)) > , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > FROM > stg_t2 > > This works OK for now. > > > > I was wondering whether this could be done using operations on rdd in > Spark? > > var rdd = sc.textFile("/data/stg/table2") > > I can use rdd.count to see the total rows and rdd.collect.foreach(println) > to see the individual rows > > > > I would like to get some ideas on how I can do CAST conversion etc on the > data to clean it up and store it in the said ORC table? > > > > Thanks > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. > > >