My CSV: *name,checked-in,booking_cost* AC,true,1200 BK,false,0 DDC,true,1200
I have done: val textFile=sc.textFile("/home/user/sampleCSV.txt") val schemaString="name,checked-in,booking_cost" import org.apache.spark.sql.Row; import org.apache.spark.sql.types.{StructType,StructField,StringType}; val schema = StructType( schemaString.split(",").map(fieldName => StructField(fieldName, StringType, true))); * val rowRDD = textFile.map(_.split(",")).map(p => Row(p(0).trim.substring(1), p(1).trim,p(2)))* val dataFrame = sqlContext.createDataFrame(rowRDD, schema); dataFrame.show +----+----------+------------+ |name|checked-in|booking_cost| +----+----------+------------+ | C| true| 1200| | K| false| 0| | DC| true| 1200| +----+----------+------------+ This will work if your column values are prefixed with '?' else you can do: val rowRDD = textFile.map(_.split(",")).map(p => Row(p(0).trim.replace('?',''), p(1).trim,p(2))) On Fri, Feb 19, 2016 at 2:36 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Ok > > > > I have created a one liner csv file as follows: > > > > cat testme.csv > > 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00" > > > > I use the following in Spark to split it > > > > csv=sc.textFile("/data/incoming/testme.csv") > > csv.map(_.split(",")).first > > res159: Array[String] = Array(360, 10/02/2014, "?2, 500.00", ?0.00, "?2, > 500.00") > > > > That comes back with an array > > > > Now all I want is to get rid of “?” and “,” in above. The problem is I > have a currency field “?2,500.00” that has got an additional “,” as well > that messes up things > > > > replaceAll() does not work > > > > Any other alternatives? > > > > Thanks, > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > > *From:* Andrew Ehrlich [mailto:and...@aehrlich.com] > *Sent:* 19 February 2016 01:22 > *To:* Mich Talebzadeh <m...@peridale.co.uk> > *Cc:* User <user@spark.apache.org> > *Subject:* Re: Hive REGEXP_REPLACE use or equivalent in Spark > > > > Use the scala method .split(",") to split the string into a collection of > strings, and try using .replaceAll() on the field with the "?" to remove it. > > > > On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > > Hi, > > What is the equivalent of this Hive statement in Spark > > > > select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]',''); > +------------+----------+--+ > | _c0 | _c1 | > +------------+----------+--+ > | ?2,500.00 | 2500.00 | > +------------+----------+--+ > > Basically I want to get rid of "?" and "," in the csv file > > > > The full csv line is > > > > scala> csv2.first > res94: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00" > > I want to transform that string into 5 columns and use "," as the split > > Thanks, > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > > >