Hi Salih, Thanks for the links :) This seems very promising to me.
When do you think this would be available in the spark codeline ? Thanks, Suraj On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop <soz...@yahoo.com> wrote: > Hi Suraj, > It seems your requirement is Record Linkage/Entity Resolution. > https://en.wikipedia.org/wiki/Record_linkage > http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf > > A presentation from Spark Summit using GraphX > > https://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark > > > Kind Regards > Salih Oztop > 07856128843 > http://www.linkedin.com/in/salihoztop > > ------------------------------ > *From:* Suraj Shetiya <surajshet...@gmail.com> > *To:* Michael Armbrust <mich...@databricks.com> > *Cc:* Salih Oztop <soz...@yahoo.com>; "user@spark.apache.org" < > user@spark.apache.org>; megha.sridh...@cynepia.com > *Sent:* Thursday, July 2, 2015 10:47 AM > > *Subject:* Re: Spark Dataframe 1.4 (GroupBy partial match) > > Hi Michael, > > Thanks for a quick response.. This sounds like something that would work. > However, Rethinking the problem statement and various other use cases, > which are growing, there are more such scenarios, where one could have > columns with structured and unstructured data embedded (json or xml or > other kind of collections), it may make sense to allow probabilistic > groupby operations where the user can get the same functionality in one > step instead of two.. > > Your thoughts on if that makes sense.. > > -Suraj > > > > > ---------- Forwarded message ---------- > From: "Michael Armbrust" <mich...@databricks.com> > Date: Jul 2, 2015 12:49 AM > Subject: Re: Spark Dataframe 1.4 (GroupBy partial match) > To: "Suraj Shetiya" <surajshet...@gmail.com> > Cc: "Salih Oztop" <soz...@yahoo.com>, "user@spark.apache.org" < > user@spark.apache.org> > > You should probably write a UDF that uses regular expression or other > string munging to canonicalize the subject and then group on that derived > column. > > On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya <surajshet...@gmail.com> > wrote: > > Thanks Salih. :) > > > The output of the groupby is as below. > > 2015-01-14 "SEC Inquiry" > 2015-01-16 "Re: SEC Inquiry" > 2015-01-18 "Fwd: Re: SEC Inquiry" > > > And subsequently, we would like to aggregate all messages with a > particular reference subject. > For instance the question we are trying to answer could be : Get the count > of messages with a particular subject. > > Looking forward to any suggestion from you. > > > On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <soz...@yahoo.com> wrote: > > Hi Suraj > What will be your output after group by? Since GroupBy is for aggregations > like sum, count etc. > If you want to count the 2015 records than it is possible. > > Kind Regards > Salih Oztop > > > ------------------------------ > *From:* Suraj Shetiya <surajshet...@gmail.com> > *To:* user@spark.apache.org > *Sent:* Tuesday, June 30, 2015 3:05 PM > *Subject:* Spark Dataframe 1.4 (GroupBy partial match) > > I have a dataset (trimmed and simplified) with 2 columns as below. > > Date Subject > 2015-01-14 "SEC Inquiry" > 2014-02-12 "Happy birthday" > 2014-02-13 "Re: Happy birthday" > 2015-01-16 "Re: SEC Inquiry" > 2015-01-18 "Fwd: Re: SEC Inquiry" > > I have imported the same in a Spark Dataframe. What I am looking at is > groupBy subject field (however, I need a partial match to identify the > discussion topic). > > For example in the above case.. I would like to group all messages, which > have subject containing "SEC Inquiry" which returns following grouped > frame: > > 2015-01-14 "SEC Inquiry" > 2015-01-16 "Re: SEC Inquiry" > 2015-01-18 "Fwd: Re: SEC Inquiry" > > Another usecase for a similar problem could be group by year (in the above > example), it would mean partial match of the date field, which would mean > groupBy Date by matching year as "2014" or "2015". > > Keenly Looking forward to reply/solution to the above. > > - Suraj > > > > > > > > > -- Regards, Suraj