My CSV:
*name,checked-in,booking_cost*
AC,true,1200
BK,false,0
DDC,true,1200
I have done:
val textFile=sc.textFile("/home/user/sampleCSV.txt")
val schemaString="name,checked-in,booking_cost"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val schema =
StructType(
schemaString.split(",").map(fieldName => StructField(fieldName,
StringType, true)));
* val rowRDD = textFile.map(_.split(",")).map(p =>
Row(p(0).trim.substring(1), p(1).trim,p(2)))*
val dataFrame = sqlContext.createDataFrame(rowRDD, schema);
dataFrame.show
+----+----------+------------+
|name|checked-in|booking_cost|
+----+----------+------------+
| C| true| 1200|
| K| false| 0|
| DC| true| 1200|
+----+----------+------------+
This will work if your column values are prefixed with '?' else you can do:
val rowRDD = textFile.map(_.split(",")).map(p =>
Row(p(0).trim.replace('?',''), p(1).trim,p(2)))
On Fri, Feb 19, 2016 at 2:36 PM, Mich Talebzadeh <[email protected]>
wrote:
> Ok
>
>
>
> I have created a one liner csv file as follows:
>
>
>
> cat testme.csv
>
> 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"
>
>
>
> I use the following in Spark to split it
>
>
>
> csv=sc.textFile("/data/incoming/testme.csv")
>
> csv.map(_.split(",")).first
>
> res159: Array[String] = Array(360, 10/02/2014, "?2, 500.00", ?0.00, "?2,
> 500.00")
>
>
>
> That comes back with an array
>
>
>
> Now all I want is to get rid of “?” and “,” in above. The problem is I
> have a currency field “?2,500.00” that has got an additional “,” as well
> that messes up things
>
>
>
> replaceAll() does not work
>
>
>
> Any other alternatives?
>
>
>
> Thanks,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
> *From:* Andrew Ehrlich [mailto:[email protected]]
> *Sent:* 19 February 2016 01:22
> *To:* Mich Talebzadeh <[email protected]>
> *Cc:* User <[email protected]>
> *Subject:* Re: Hive REGEXP_REPLACE use or equivalent in Spark
>
>
>
> Use the scala method .split(",") to split the string into a collection of
> strings, and try using .replaceAll() on the field with the "?" to remove it.
>
>
>
> On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh <[email protected]>
> wrote:
>
> Hi,
>
> What is the equivalent of this Hive statement in Spark
>
>
>
> select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]','');
> +------------+----------+--+
> | _c0 | _c1 |
> +------------+----------+--+
> | ?2,500.00 | 2500.00 |
> +------------+----------+--+
>
> Basically I want to get rid of "?" and "," in the csv file
>
>
>
> The full csv line is
>
>
>
> scala> csv2.first
> res94: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"
>
> I want to transform that string into 5 columns and use "," as the split
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
>
>