Hi, Need some ideas.
*Summary:* I am working on a tool to slice and dice the amount of money I have spent so far (meaning the whole data sample) on a given retailer so I have a better idea of where I am wasting the money *Approach* Downloaded my bank statements from a given account in csv format from inception till end of July. Read the data and stored it in ORC table. I am interested for all bills that I paid using Debit Card ( transactiontype = "DEB") that comes out the account directly. Transactiontype is the three character code lookup that I download as well. scala> ll_18740868.printSchema root |-- transactiondate: date (nullable = true) |-- transactiontype: string (nullable = true) |-- sortcode: string (nullable = true) |-- accountnumber: string (nullable = true) |-- transactiondescription: string (nullable = true) |-- debitamount: double (nullable = true) |-- creditamount: double (nullable = true) |-- balance: double (nullable = true) The important fields are transactiondate, transactiontype, transactiondescription and debitamount So using analytics. windowing I can do all sorts of things. For example this one gives me the last time I spent money on retailer XYZ and the amount SELECT * FROM ( select transactiondate, transactiondescription, debitamount , rank() over (order by transactiondate desc) AS rank from accounts.ll_18740868 where transactiondescription like '%XYZ%' ) tmp where rank <= 1 And its equivalent using Windowing in FP import org.apache.spark.sql.expressions.Window val wSpec = Window.partitionBy("transactiontype").orderBy(desc("transactiondate")) ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription", rank().over(wSpec).as("rank")).filter($"rank"===1).show +---------------+----------------------+----+ |transactiondate|transactiondescription|rank| +---------------+----------------------+----+ | 2015-12-15| XYZ LTD CD 4636 | 1| +---------------+----------------------+----+ So far so good. But if I want to find all I spent on each retailer, then it gets trickier as a retailer appears like below in the column transactiondescription: ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5) +----------------------+ |transactiondescription| +----------------------+ | SAINSBURYS SMKT C...| | SACAT SAINSBURYS ...| | SAINSBURY'S SMKT ...| | SAINSBURYS S/MKT ...| | SACAT SAINSBURYS ...| +----------------------+ If I look at them I know they all belong to SAINBURYS (food retailer). I have done some crude grouping and it works somehow //define UDF here to handle substring val SubstrUDF = udf { (s: String, start: Int, end: Int) => s.substring(start, end) } var cutoff = "CD" // currently used in the statement val wSpec2 = Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", cutoff)-1)) ll_18740868.where($"transactiontype" === "DEB" && ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println) However, I really need to extract the "keyword" retailer name from transactiondescription column And I need some ideas about the best way of doing it. Is this possible in Spark? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.