RE: Hive REGEXP_REPLACE use or equivalent in Spark

Mich Talebzadeh Fri, 19 Feb 2016 07:50:47 -0800

Thanks very helpful indeed


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Chandeep Singh [mailto:c...@chandeep.com] 
Sent: 19 February 2016 10:35
To: Mich Talebzadeh <m...@peridale.co.uk>
Cc: user @spark <user@spark.apache.org>
Subject: Re: Hive REGEXP_REPLACE use or equivalent in Spark

 

You might be better off using the CSV loader in this case. 
https://github.com/databricks/spark-csv

 

Input:

[csingh ~]$ hadoop fs -cat test.csv

360,10/02/2014,"?2,500.00",?0.00,"?2,500.00”

 

and here is quick ad dirty way to resolve your issue..

 

val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").load("test.csv")

—> df: org.apache.spark.sql.DataFrame = [C0: int, C1: string, C2: string, C3: 
string, C4: string

 

df.first()

—> res0: org.apache.spark.sql.Row = [360,10/02/2014,?2,500.00,?0.00,?2,500.00]

 

val a = df.map(x => (x.getInt(0), x.getString(1), x.getString(2).replace("?", 
"").replace(",", ""), x.getString(3).replace("?", ""), 
x.getString(4).replace("?", "").replace(",", "")))

—> a: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = 
MapPartitionsRDD[17] at map at <console>:21

    

a.collect()

—> res1: Array[(Int, String, String, String, String)] = 
Array((360,10/02/2014,2500.00,0.00,2500.00))

 

On Feb 19, 2016, at 9:06 AM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

 

Ok

 

I have created a one liner csv file as follows:

 

cat testme.csv

360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"

 

I use the following in Spark to split it

 

csv=sc.textFile("/data/incoming/testme.csv")

csv.map(_.split(",")).first

res159: Array[String] = Array(360, 10/02/2014, "?2, 500.00", ?0.00, "?2, 
500.00")

 

That comes back with an array

 

Now all I want is to get rid of “?” and “,” in above. The problem is I have a 
currency field “?2,500.00” that has got an additional “,” as well that messes 
up things

 

replaceAll() does not work

 

Any other alternatives?

 

Thanks,

 

 

Dr Mich Talebzadeh

 

LinkedIn   
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Andrew Ehrlich [mailto:and...@aehrlich.com] 
Sent: 19 February 2016 01:22
To: Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk> >
Cc: User <user@spark.apache.org <mailto:user@spark.apache.org> >
Subject: Re: Hive REGEXP_REPLACE use or equivalent in Spark

 

Use the scala method .split(",") to split the string into a collection of 
strings, and try using .replaceAll() on the field with the "?" to remove it.

 

On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh < <mailto:m...@peridale.co.uk> 
m...@peridale.co.uk> wrote:

Hi,

What is the equivalent of this Hive statement in Spark

 

select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]','');
+------------+----------+--+
|    _c0     |   _c1    |
+------------+----------+--+
| ?2,500.00  | 2500.00  |
+------------+----------+--+

Basically I want to get rid of "?" and "," in the csv file

 

The full csv line is

 

scala> csv2.first
res94: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"

I want to transform that string into 5 columns and use "," as the split

Thanks,

Dr Mich Talebzadeh

 

LinkedIn   
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

RE: Hive REGEXP_REPLACE use or equivalent in Spark

Reply via email to