Re: Importing csv files into Hive ORC target table

Alex Dzhagriev Thu, 18 Feb 2016 13:00:51 -0800

Hi Mich,

Try to use a regexp to parse your string instead of the split.


Thanks, Alex.

On Thu, Feb 18, 2016 at 6:35 PM, Mich Talebzadeh <
[email protected]> wrote:

>
>
> thanks,
>
>
>
> I have an issue here.
>
> define rdd to read the CSV file
>
> scala> var csv = sc.textFile("/data/stg/table2")
> csv: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at textFile
> at <console>:27
>
> I then get rid of the header
>
> scala> val csv2 = csv.mapPartitionsWithIndex { (idx, iter) => if (idx ==
> 0) iter.drop(1) else iter }
> csv2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at
> mapPartitionsWithIndex at <console>:29
>
> This is what I have now
>
> scala> csv.first
>
> res79: String = Invoice Number,Payment date,Net,VAT,Total
>
> *scala> csv2.first*
>
> *res80: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"*
>
> Then I define a class based on the columns
>
> scala> case class Invoice(invoice: String, date: String, net: String, vat:
> String, total: String)
> defined class Invoice
>
> Next stage to map the data to their individual columns
>
> scala> val ttt = csv2.map(_.split(",")).map(p =>
> Invoice(p(0),p(1),p(2),p(3),p(4)))
> ttt: org.apache.spark.rdd.RDD[Invoice] = MapPartitionsRDD[74] at map at
> <console>:33
>
> the problem now I have is that one column is missing
>
> *scala> ttt.first*
> *res81: Invoice = Invoice(360,10/02/2014,"?2,500.00",?0.00)*
>
> it seems that I am missing the last column here!
>
> I suspect the cause of the problem is the "," used in "?2,500.00" which is
> a money column of "£250000" in excel.
>
> Any work around is appreciated.
>
> Thanks,
>
> Mich
>
>
>
>
>
> On 17/02/2016 22:58, Alex Dzhagriev wrote:
>
> Hi Mich,
>
> You can use data frames (
> http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
> to achieve that.
>
> val sqlContext = new HiveContext(sc)
>
> var rdd = sc.textFile("/data/stg/table2")
>
> //...
> //perform you business logic, cleanups, etc.
> //...
>
> sqlContext.createDataFrame(resultRdd).write.orc("..path..")
>
> Please, note that resultRdd should contain Products (e.g. case classes)
>
> Cheers, Alex.
>
>
>
> On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh <
> [email protected]> wrote:
>
>> Hi,
>>
>> We put csv files that are zipped using bzip into a staging are on hdfs
>>
>> In Hive an external table is created as below:
>>
>> DROP TABLE IF EXISTS stg_t2;
>> CREATE EXTERNAL TABLE stg_t2 (
>>  INVOICENUMBER string
>> ,PAYMENTDATE string
>> ,NET string
>> ,VAT string
>> ,TOTAL string
>> )
>> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>> STORED AS TEXTFILE
>> LOCATION '/data/stg/table2'
>> TBLPROPERTIES ("skip.header.line.count"="1")
>>
>> We have an ORC table in Hive created as below:
>>
>>
>>
>> DROP TABLE IF EXISTS t2;
>> CREATE TABLE t2 (
>>  INVOICENUMBER          INT
>> ,PAYMENTDATE            timestamp
>> ,NET                    DECIMAL(20,2)
>> ,VAT                    DECIMAL(20,2)
>> ,TOTAL                  DECIMAL(20,2)
>> )
>> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
>> STORED AS ORC
>> TBLPROPERTIES ( "orc.compress"="ZLIB" )
>> ;
>>
>> Then we insert the data from the external table into target table do some
>> conversion and ignoring empty rows
>>
>> INSERT INTO TABLE t2
>> SELECT
>>           INVOICENUMBER
>>         , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
>>         --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
>>         , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>>         , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
>>         , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
>> FROM
>> stg_t2
>>
>> This works OK for now.
>>
>>
>>
>> I was wondering whether this could be done using operations on rdd in
>> Spark?
>>
>> var rdd = sc.textFile("/data/stg/table2")
>>
>> I can use rdd.count to see the total rows and
>> rdd.collect.foreach(println) to see the individual rows
>>
>>
>>
>> I would like to get some ideas on how I can do CAST conversion etc on the
>> data to clean it up and store it in the said ORC table?
>>
>>
>>
>> Thanks
>>
>> --
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> http://talebzadehmich.wordpress.com
>>
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Cloud Technology 
>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>> stated. It is the responsibility of the recipient to ensure that this email 
>> is virus free, therefore neither Cloud Technology partners Ltd, its 
>> subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>

Re: Importing csv files into Hive ORC target table

Reply via email to