Re: Extracting key word from a textual column

Sonal Goyal Tue, 02 Aug 2016 09:57:18 -0700

Hi Mich,

It seems like an entity resolution problem - looking at different
representations of an entity - SAINSBURY in this case and matching them all
together. How dirty is your data in the description - are there stop words
like SACAT/SMKT etc you can strip off and get the base retailer entity ?


Best Regards,
Sonal
Founder, Nube Technologies <http://www.nubetech.co>
Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 2015
<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>

<http://in.linkedin.com/in/sonalgoyal>



On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks.
>
> I believe there is some catalog of companies that I can get and store it
> in a table and math the company name to transactiondesciption column.
>
> That catalog should have sectors in it. For example company XYZ is under
> Grocers etc which will make search and grouping much easier.
>
> I believe Spark can do it, though I am generally interested on alternative
> ideas.
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote:
>
>> Well, if you still want to use windows function for your logic, then you
>> need to derive a new column out, like "catalog", and use it as part of
>> grouping logic.
>>
>>
>> Maybe you can use regex for deriving out this new column. The
>> implementation needs to depend on your data in "transactiondescription",
>> and regex gives you the most powerful way to handle your data.
>>
>>
>> This is really not a Spark question, but how to you process your logic
>> based on the data given.
>>
>>
>> Yong
>>
>>
>> ------------------------------
>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
>> *Sent:* Tuesday, August 2, 2016 10:00 AM
>> *To:* user @spark
>> *Subject:* Extracting key word from a textual column
>>
>> Hi,
>>
>> Need some ideas.
>>
>> *Summary:*
>>
>> I am working on a tool to slice and dice the amount of money I have spent
>> so far (meaning the whole data sample) on a given retailer so I have a
>> better idea of where I am wasting the money
>>
>> *Approach*
>>
>> Downloaded my bank statements from a given account in csv format from
>> inception till end of July. Read the data and stored it in ORC table.
>>
>> I am interested for all bills that I paid using Debit Card (
>> transactiontype = "DEB") that comes out the account directly.
>> Transactiontype is the three character code lookup that I download as well.
>>
>> scala> ll_18740868.printSchema
>> root
>>  |-- transactiondate: date (nullable = true)
>>  |-- transactiontype: string (nullable = true)
>>  |-- sortcode: string (nullable = true)
>>  |-- accountnumber: string (nullable = true)
>>  |-- transactiondescription: string (nullable = true)
>>  |-- debitamount: double (nullable = true)
>>  |-- creditamount: double (nullable = true)
>>  |-- balance: double (nullable = true)
>>
>> The important fields are transactiondate, transactiontype,
>> transactiondescription and debitamount
>>
>> So using analytics. windowing I can do all sorts of things. For example
>> this one gives me the last time I spent money on retailer XYZ and the amount
>>
>> SELECT *
>> FROM (
>>       select transactiondate, transactiondescription, debitamount
>>       , rank() over (order by transactiondate desc) AS rank
>>       from accounts.ll_18740868 where transactiondescription like '%XYZ%'
>>      ) tmp
>> where rank <= 1
>>
>> And its equivalent using Windowing in FP
>>
>> import org.apache.spark.sql.expressions.Window
>> val wSpec =
>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>> rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>
>>
>> +---------------+----------------------+----+
>> |transactiondate|transactiondescription|rank|
>> +---------------+----------------------+----+
>> |     2015-12-15|  XYZ LTD CD 4636 |   1|
>> +---------------+----------------------+----+
>>
>> So far so good. But if I want to find all I spent on each retailer, then
>> it gets trickier as a retailer appears like below in the column
>> transactiondescription:
>>
>>
>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>> +----------------------+
>> |transactiondescription|
>> +----------------------+
>> |  SAINSBURYS SMKT C...|
>> |  SACAT SAINSBURYS ...|
>> |  SAINSBURY'S SMKT ...|
>> |  SAINSBURYS S/MKT ...|
>> |  SACAT SAINSBURYS ...|
>> +----------------------+
>>
>> If I look at them I know they all belong to SAINBURYS (food retailer). I
>> have done some crude grouping and it works somehow
>>
>> //define UDF here to handle substring
>> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
>> s.substring(start, end) }
>> var cutoff = "CD"  // currently used in the statement
>> val wSpec2 =
>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>> cutoff)-1))
>> ll_18740868.where($"transactiontype" === "DEB" &&
>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)
>>
>> However, I really need to extract the "keyword" retailer name from
>> transactiondescription column And I need some ideas about the best way of
>> doing it. Is this possible in Spark?
>>
>> Thanks
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Extracting key word from a textual column

Reply via email to