Hi, 

We put csv files that are zipped using bzip into a staging are on hdfs 

In Hive an external table is created as below: 

DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/stg/table2'
TBLPROPERTIES ("skip.header.line.count"="1") 

We have an ORC table in Hive created as below: 

DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER INT
,PAYMENTDATE timestamp
,NET DECIMAL(20,2)
,VAT DECIMAL(20,2)
,TOTAL DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
; 

Then we insert the data from the external table into target table do
some conversion and ignoring empty rows 

INSERT INTO TABLE t2
SELECT
 INVOICENUMBER
 , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
 --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
 , CAST(REGEXP_REPLACE(net,'[^\d\.]','') AS DECIMAL(20,2))
 , CAST(REGEXP_REPLACE(vat,'[^\d\.]','') AS DECIMAL(20,2))
 , CAST(REGEXP_REPLACE(total,'[^\d\.]','') AS DECIMAL(20,2))
FROM
stg_t2 

This works OK for now. 

I was wondering whether this could be done using operations on rdd in
Spark? 

var rdd = sc.textFile("/data/stg/table2") 

I can use rdd.count to see the total rows and
rdd.collect.foreach(println) to see the individual rows 

I would like to get some ideas on how I can do CAST conversion etc on
the data to clean it up and store it in the said ORC table? 

Thanks 

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Reply via email to