I agree with you.
> On 03 Aug 2016, at 01:20, ayan guha <guha.a...@gmail.com> wrote: > > I would stay away from transaction tables until they are fully baked. I do > not see why you need to update vs keep inserting with timestamp and while > joining derive latest value on the fly. > > But I guess it has became a religious question now :) and I am not unbiased. > >> On 3 Aug 2016 08:51, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote: >> There are many ways of addressing this issue. >> >> Using Hbase with Phoenix adds another layer to the stack which is not >> necessary for handful of table and will add to cost (someone else has to >> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase >> table. It is faster) >> >> There may be say 100 new entries into this catalog table with multiple >> updates (not a single DML) to get hashtag right. sometimes it is an >> iterative process which results in many deltas. >> >> If that is needed done once a day or on demand, an alternative would be to >> insert overwrite the transactional hive table with deltas into a text table >> in Hive and present that one to Spark. This allows Spark to see the data. >> >> Remember if I use Hive to do the analytics/windowing, there is no issue. The >> issue is with Spark that neither Spark SQL or Spark shell can use that table. >> >> Sounds like an issue for Spark to resolve later. >> >> Another alternative one can leave the transactional table in RDBMS for this >> purpose and load it into DF through JDBC interface. It works fine and pretty >> fast. >> >> Again these are all workarounds. I discussed this in Hive forum. There >> should be a way" to manually compact a transactional table in Hive" (not >> possible now) and second point if Hive can see the data in Hive table, why >> not Spark? >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >>> On 2 August 2016 at 23:10, Ted Yu <yuzhih...@gmail.com> wrote: >>> +1 >>> >>>> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>>> >>>> If you need to use single inserts, updates, deletes, select why not use >>>> hbase with Phoenix? I see it as complementary to the hive / warehouse >>>> offering >>>> >>>>> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I decided to create a catalog table in Hive ORC and transactional. That >>>>> table has two columns of value >>>>> >>>>> transactiondescription === account_table.transactiondescription >>>>> hashtag String column created from a semi automated process of deriving >>>>> it from account_table.transactiondescription >>>>> Once the process is complete in populating the catalog table then we just >>>>> need to create a new DF based on join between catalog table and the >>>>> account_table. The join will use hashtag in catalog table to loop over >>>>> debit column in account_table for a given hashtag. That is pretty fast as >>>>> going through pattern matching is pretty intensive in any application and >>>>> database in real time. >>>>> >>>>> So one can build up the catalog table over time as a reference table. I >>>>> am sure such tables exist in commercial world. >>>>> >>>>> Anyway after getting results out I know how I am wasting my money on >>>>> different things, especially on clothing etc :) >>>>> >>>>> >>>>> HTH >>>>> >>>>> P.S. Also there is an issue with Spark not being able to read data >>>>> through Hive transactional tables that have not been compacted yet. Spark >>>>> just crashes. If these tables need to be updated regularly say catalog >>>>> table and they are pretty small, one might maintain them in an RDBMS and >>>>> read them once through JDBC into a DataFrame in Spark before doing >>>>> analytics. >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>> loss, damage or destruction of data or any other property which may arise >>>>> from relying on this email's technical content is explicitly disclaimed. >>>>> The author will in no case be liable for any monetary damages arising >>>>> from such loss, damage or destruction. >>>>> >>>>> >>>>>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote: >>>>>> Hi Mich, >>>>>> >>>>>> It seems like an entity resolution problem - looking at different >>>>>> representations of an entity - SAINSBURY in this case and matching them >>>>>> all together. How dirty is your data in the description - are there stop >>>>>> words like SACAT/SMKT etc you can strip off and get the base retailer >>>>>> entity ? >>>>>> >>>>>> Best Regards, >>>>>> Sonal >>>>>> Founder, Nube Technologies >>>>>> Reifier at Strata Hadoop World >>>>>> Reifier at Spark Summit 2015 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh >>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>> Thanks. >>>>>>> >>>>>>> I believe there is some catalog of companies that I can get and store >>>>>>> it in a table and math the company name to transactiondesciption column. >>>>>>> >>>>>>> That catalog should have sectors in it. For example company XYZ is >>>>>>> under Grocers etc which will make search and grouping much easier. >>>>>>> >>>>>>> I believe Spark can do it, though I am generally interested on >>>>>>> alternative ideas. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>>> loss, damage or destruction of data or any other property which may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote: >>>>>>>> Well, if you still want to use windows function for your logic, then >>>>>>>> you need to derive a new column out, like "catalog", and use it as >>>>>>>> part of grouping logic. >>>>>>>> >>>>>>>> >>>>>>>> Maybe you can use regex for deriving out this new column. The >>>>>>>> implementation needs to depend on your data in >>>>>>>> "transactiondescription", and regex gives you the most powerful way to >>>>>>>> handle your data. >>>>>>>> >>>>>>>> >>>>>>>> This is really not a Spark question, but how to you process your logic >>>>>>>> based on the data given. >>>>>>>> >>>>>>>> >>>>>>>> Yong >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>>> Sent: Tuesday, August 2, 2016 10:00 AM >>>>>>>> To: user @spark >>>>>>>> Subject: Extracting key word from a textual column >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Need some ideas. >>>>>>>> >>>>>>>> Summary: >>>>>>>> >>>>>>>> I am working on a tool to slice and dice the amount of money I have >>>>>>>> spent so far (meaning the whole data sample) on a given retailer so I >>>>>>>> have a better idea of where I am wasting the money >>>>>>>> >>>>>>>> Approach >>>>>>>> >>>>>>>> Downloaded my bank statements from a given account in csv format from >>>>>>>> inception till end of July. Read the data and stored it in ORC table. >>>>>>>> >>>>>>>> I am interested for all bills that I paid using Debit Card ( >>>>>>>> transactiontype = "DEB") that comes out the account directly. >>>>>>>> Transactiontype is the three character code lookup that I download as >>>>>>>> well. >>>>>>>> >>>>>>>> scala> ll_18740868.printSchema >>>>>>>> root >>>>>>>> |-- transactiondate: date (nullable = true) >>>>>>>> |-- transactiontype: string (nullable = true) >>>>>>>> |-- sortcode: string (nullable = true) >>>>>>>> |-- accountnumber: string (nullable = true) >>>>>>>> |-- transactiondescription: string (nullable = true) >>>>>>>> |-- debitamount: double (nullable = true) >>>>>>>> |-- creditamount: double (nullable = true) >>>>>>>> |-- balance: double (nullable = true) >>>>>>>> >>>>>>>> The important fields are transactiondate, transactiontype, >>>>>>>> transactiondescription and debitamount >>>>>>>> >>>>>>>> So using analytics. windowing I can do all sorts of things. For >>>>>>>> example this one gives me the last time I spent money on retailer XYZ >>>>>>>> and the amount >>>>>>>> >>>>>>>> SELECT * >>>>>>>> FROM ( >>>>>>>> select transactiondate, transactiondescription, debitamount >>>>>>>> , rank() over (order by transactiondate desc) AS rank >>>>>>>> from accounts.ll_18740868 where transactiondescription like >>>>>>>> '%XYZ%' >>>>>>>> ) tmp >>>>>>>> where rank <= 1 >>>>>>>> >>>>>>>> And its equivalent using Windowing in FP >>>>>>>> >>>>>>>> import org.apache.spark.sql.expressions.Window >>>>>>>> val wSpec = >>>>>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate")) >>>>>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription", >>>>>>>> rank().over(wSpec).as("rank")).filter($"rank"===1).show >>>>>>>> >>>>>>>> >>>>>>>> +---------------+----------------------+----+ >>>>>>>> |transactiondate|transactiondescription|rank| >>>>>>>> +---------------+----------------------+----+ >>>>>>>> | 2015-12-15| XYZ LTD CD 4636 | 1| >>>>>>>> +---------------+----------------------+----+ >>>>>>>> >>>>>>>> So far so good. But if I want to find all I spent on each retailer, >>>>>>>> then it gets trickier as a retailer appears like below in the column >>>>>>>> transactiondescription: >>>>>>>> >>>>>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5) >>>>>>>> +----------------------+ >>>>>>>> |transactiondescription| >>>>>>>> +----------------------+ >>>>>>>> | SAINSBURYS SMKT C...| >>>>>>>> | SACAT SAINSBURYS ...| >>>>>>>> | SAINSBURY'S SMKT ...| >>>>>>>> | SAINSBURYS S/MKT ...| >>>>>>>> | SACAT SAINSBURYS ...| >>>>>>>> +----------------------+ >>>>>>>> >>>>>>>> If I look at them I know they all belong to SAINBURYS (food retailer). >>>>>>>> I have done some crude grouping and it works somehow >>>>>>>> >>>>>>>> //define UDF here to handle substring >>>>>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => >>>>>>>> s.substring(start, end) } >>>>>>>> var cutoff = "CD" // currently used in the statement >>>>>>>> val wSpec2 = >>>>>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>>>>> cutoff)-1)) >>>>>>>> ll_18740868.where($"transactiontype" === "DEB" && >>>>>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>>>>> >>>>>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println) >>>>>>>> >>>>>>>> However, I really need to extract the "keyword" retailer name from >>>>>>>> transactiondescription column And I need some ideas about the best way >>>>>>>> of doing it. Is this possible in Spark? >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> LinkedIn >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> >>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>> >>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>>>> any loss, damage or destruction of data or any other property which >>>>>>>> may arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages arising from such loss, damage or destruction.