Hi,
We put csv files that are zipped using bzip into a staging are on hdfs In Hive an external table is created as below: DROP TABLE IF EXISTS stg_t2; CREATE EXTERNAL TABLE stg_t2 ( INVOICENUMBER string ,PAYMENTDATE string ,NET string ,VAT string ,TOTAL string ) COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology' ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/data/stg/table2' TBLPROPERTIES ("skip.header.line.count"="1") We have an ORC table in Hive created as below: DROP TABLE IF EXISTS t2; CREATE TABLE t2 ( INVOICENUMBER INT ,PAYMENTDATE timestamp ,NET DECIMAL(20,2) ,VAT DECIMAL(20,2) ,TOTAL DECIMAL(20,2) ) COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology' STORED AS ORC TBLPROPERTIES ( "orc.compress"="ZLIB" ) ; Then we insert the data from the external table into target table do some conversion and ignoring empty rows INSERT INTO TABLE t2 SELECT INVOICENUMBER , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp) --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2)) , CAST(REGEXP_REPLACE(net,'[^\d\.]','') AS DECIMAL(20,2)) , CAST(REGEXP_REPLACE(vat,'[^\d\.]','') AS DECIMAL(20,2)) , CAST(REGEXP_REPLACE(total,'[^\d\.]','') AS DECIMAL(20,2)) FROM stg_t2 This works OK for now. I was wondering whether this could be done using operations on rdd in Spark? var rdd = sc.textFile("/data/stg/table2") I can use rdd.count to see the total rows and rdd.collect.foreach(println) to see the individual rows I would like to get some ideas on how I can do CAST conversion etc on the data to clean it up and store it in the said ORC table? Thanks -- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility.