Hi Mich, It seems like an entity resolution problem - looking at different representations of an entity - SAINSBURY in this case and matching them all together. How dirty is your data in the description - are there stop words like SACAT/SMKT etc you can strip off and get the base retailer entity ?
Best Regards, Sonal Founder, Nube Technologies <http://www.nubetech.co> Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM> Reifier at Spark Summit 2015 <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> <http://in.linkedin.com/in/sonalgoyal> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks. > > I believe there is some catalog of companies that I can get and store it > in a table and math the company name to transactiondesciption column. > > That catalog should have sectors in it. For example company XYZ is under > Grocers etc which will make search and grouping much easier. > > I believe Spark can do it, though I am generally interested on alternative > ideas. > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote: > >> Well, if you still want to use windows function for your logic, then you >> need to derive a new column out, like "catalog", and use it as part of >> grouping logic. >> >> >> Maybe you can use regex for deriving out this new column. The >> implementation needs to depend on your data in "transactiondescription", >> and regex gives you the most powerful way to handle your data. >> >> >> This is really not a Spark question, but how to you process your logic >> based on the data given. >> >> >> Yong >> >> >> ------------------------------ >> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >> *Sent:* Tuesday, August 2, 2016 10:00 AM >> *To:* user @spark >> *Subject:* Extracting key word from a textual column >> >> Hi, >> >> Need some ideas. >> >> *Summary:* >> >> I am working on a tool to slice and dice the amount of money I have spent >> so far (meaning the whole data sample) on a given retailer so I have a >> better idea of where I am wasting the money >> >> *Approach* >> >> Downloaded my bank statements from a given account in csv format from >> inception till end of July. Read the data and stored it in ORC table. >> >> I am interested for all bills that I paid using Debit Card ( >> transactiontype = "DEB") that comes out the account directly. >> Transactiontype is the three character code lookup that I download as well. >> >> scala> ll_18740868.printSchema >> root >> |-- transactiondate: date (nullable = true) >> |-- transactiontype: string (nullable = true) >> |-- sortcode: string (nullable = true) >> |-- accountnumber: string (nullable = true) >> |-- transactiondescription: string (nullable = true) >> |-- debitamount: double (nullable = true) >> |-- creditamount: double (nullable = true) >> |-- balance: double (nullable = true) >> >> The important fields are transactiondate, transactiontype, >> transactiondescription and debitamount >> >> So using analytics. windowing I can do all sorts of things. For example >> this one gives me the last time I spent money on retailer XYZ and the amount >> >> SELECT * >> FROM ( >> select transactiondate, transactiondescription, debitamount >> , rank() over (order by transactiondate desc) AS rank >> from accounts.ll_18740868 where transactiondescription like '%XYZ%' >> ) tmp >> where rank <= 1 >> >> And its equivalent using Windowing in FP >> >> import org.apache.spark.sql.expressions.Window >> val wSpec = >> Window.partitionBy("transactiontype").orderBy(desc("transactiondate")) >> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription", >> rank().over(wSpec).as("rank")).filter($"rank"===1).show >> >> >> +---------------+----------------------+----+ >> |transactiondate|transactiondescription|rank| >> +---------------+----------------------+----+ >> | 2015-12-15| XYZ LTD CD 4636 | 1| >> +---------------+----------------------+----+ >> >> So far so good. But if I want to find all I spent on each retailer, then >> it gets trickier as a retailer appears like below in the column >> transactiondescription: >> >> >> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5) >> +----------------------+ >> |transactiondescription| >> +----------------------+ >> | SAINSBURYS SMKT C...| >> | SACAT SAINSBURYS ...| >> | SAINSBURY'S SMKT ...| >> | SAINSBURYS S/MKT ...| >> | SACAT SAINSBURYS ...| >> +----------------------+ >> >> If I look at them I know they all belong to SAINBURYS (food retailer). I >> have done some crude grouping and it works somehow >> >> //define UDF here to handle substring >> val SubstrUDF = udf { (s: String, start: Int, end: Int) => >> s.substring(start, end) } >> var cutoff = "CD" // currently used in the statement >> val wSpec2 = >> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >> cutoff)-1)) >> ll_18740868.where($"transactiontype" === "DEB" && >> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println) >> >> However, I really need to extract the "keyword" retailer name from >> transactiondescription column And I need some ideas about the best way of >> doing it. Is this possible in Spark? >> >> Thanks >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> > >