Hi Mich, Try to use a regexp to parse your string instead of the split.
Thanks, Alex. On Thu, Feb 18, 2016 at 6:35 PM, Mich Talebzadeh < [email protected]> wrote: > > > thanks, > > > > I have an issue here. > > define rdd to read the CSV file > > scala> var csv = sc.textFile("/data/stg/table2") > csv: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at textFile > at <console>:27 > > I then get rid of the header > > scala> val csv2 = csv.mapPartitionsWithIndex { (idx, iter) => if (idx == > 0) iter.drop(1) else iter } > csv2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at > mapPartitionsWithIndex at <console>:29 > > This is what I have now > > scala> csv.first > > res79: String = Invoice Number,Payment date,Net,VAT,Total > > *scala> csv2.first* > > *res80: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"* > > Then I define a class based on the columns > > scala> case class Invoice(invoice: String, date: String, net: String, vat: > String, total: String) > defined class Invoice > > Next stage to map the data to their individual columns > > scala> val ttt = csv2.map(_.split(",")).map(p => > Invoice(p(0),p(1),p(2),p(3),p(4))) > ttt: org.apache.spark.rdd.RDD[Invoice] = MapPartitionsRDD[74] at map at > <console>:33 > > the problem now I have is that one column is missing > > *scala> ttt.first* > *res81: Invoice = Invoice(360,10/02/2014,"?2,500.00",?0.00)* > > it seems that I am missing the last column here! > > I suspect the cause of the problem is the "," used in "?2,500.00" which is > a money column of "£250000" in excel. > > Any work around is appreciated. > > Thanks, > > Mich > > > > > > On 17/02/2016 22:58, Alex Dzhagriev wrote: > > Hi Mich, > > You can use data frames ( > http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes) > to achieve that. > > val sqlContext = new HiveContext(sc) > > var rdd = sc.textFile("/data/stg/table2") > > //... > //perform you business logic, cleanups, etc. > //... > > sqlContext.createDataFrame(resultRdd).write.orc("..path..") > > Please, note that resultRdd should contain Products (e.g. case classes) > > Cheers, Alex. > > > > On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh < > [email protected]> wrote: > >> Hi, >> >> We put csv files that are zipped using bzip into a staging are on hdfs >> >> In Hive an external table is created as below: >> >> DROP TABLE IF EXISTS stg_t2; >> CREATE EXTERNAL TABLE stg_t2 ( >> INVOICENUMBER string >> ,PAYMENTDATE string >> ,NET string >> ,VAT string >> ,TOTAL string >> ) >> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology' >> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' >> STORED AS TEXTFILE >> LOCATION '/data/stg/table2' >> TBLPROPERTIES ("skip.header.line.count"="1") >> >> We have an ORC table in Hive created as below: >> >> >> >> DROP TABLE IF EXISTS t2; >> CREATE TABLE t2 ( >> INVOICENUMBER INT >> ,PAYMENTDATE timestamp >> ,NET DECIMAL(20,2) >> ,VAT DECIMAL(20,2) >> ,TOTAL DECIMAL(20,2) >> ) >> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology' >> STORED AS ORC >> TBLPROPERTIES ( "orc.compress"="ZLIB" ) >> ; >> >> Then we insert the data from the external table into target table do some >> conversion and ignoring empty rows >> >> INSERT INTO TABLE t2 >> SELECT >> INVOICENUMBER >> , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp) >> --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2)) >> , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2)) >> , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2)) >> , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) >> FROM >> stg_t2 >> >> This works OK for now. >> >> >> >> I was wondering whether this could be done using operations on rdd in >> Spark? >> >> var rdd = sc.textFile("/data/stg/table2") >> >> I can use rdd.count to see the total rows and >> rdd.collect.foreach(println) to see the individual rows >> >> >> >> I would like to get some ideas on how I can do CAST conversion etc on the >> data to clean it up and store it in the said ORC table? >> >> >> >> Thanks >> >> -- >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> http://talebzadehmich.wordpress.com >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Cloud Technology >> Partners Ltd, its subsidiaries or their employees, unless expressly so >> stated. It is the responsibility of the recipient to ensure that this email >> is virus free, therefore neither Cloud Technology partners Ltd, its >> subsidiaries nor their employees accept any responsibility. >> >> >> > > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. > > >
