Thanks very helpful indeed


Dr Mich Talebzadeh


LinkedIn <> 


NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.



From: Chandeep Singh [] 
Sent: 19 February 2016 10:35
To: Mich Talebzadeh <>
Cc: user @spark <>
Subject: Re: Hive REGEXP_REPLACE use or equivalent in Spark


You might be better off using the CSV loader in this case.



[csingh ~]$ hadoop fs -cat test.csv



and here is quick ad dirty way to resolve your issue..


val df ="com.databricks.spark.csv").option("inferSchema", 

—> df: org.apache.spark.sql.DataFrame = [C0: int, C1: string, C2: string, C3: 
string, C4: string



—> res0: org.apache.spark.sql.Row = [360,10/02/2014,?2,500.00,?0.00,?2,500.00]


val a = => (x.getInt(0), x.getString(1), x.getString(2).replace("?", 
"").replace(",", ""), x.getString(3).replace("?", ""), 
x.getString(4).replace("?", "").replace(",", "")))

—> a: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = 
MapPartitionsRDD[17] at map at <console>:21



—> res1: Array[(Int, String, String, String, String)] = 


On Feb 19, 2016, at 9:06 AM, Mich Talebzadeh < 
<> > wrote:




I have created a one liner csv file as follows:


cat testme.csv



I use the following in Spark to split it



res159: Array[String] = Array(360, 10/02/2014, "?2, 500.00", ?0.00, "?2, 


That comes back with an array


Now all I want is to get rid of “?” and “,” in above. The problem is I have a 
currency field “?2,500.00” that has got an additional “,” as well that messes 
up things


replaceAll() does not work


Any other alternatives?





Dr Mich Talebzadeh






NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.



From: Andrew Ehrlich [] 
Sent: 19 February 2016 01:22
To: Mich Talebzadeh < <> >
Cc: User < <> >
Subject: Re: Hive REGEXP_REPLACE use or equivalent in Spark


Use the scala method .split(",") to split the string into a collection of 
strings, and try using .replaceAll() on the field with the "?" to remove it.


On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh < <>> wrote:


What is the equivalent of this Hive statement in Spark


select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]','');
|    _c0     |   _c1    |
| ?2,500.00  | 2500.00  |

Basically I want to get rid of "?" and "," in the csv file


The full csv line is


scala> csv2.first
res94: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"

I want to transform that string into 5 columns and use "," as the split


Dr Mich Talebzadeh






NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.


Reply via email to