You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column.
On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya <surajshet...@gmail.com> wrote: > Thanks Salih. :) > > > The output of the groupby is as below. > > 2015-01-14 "SEC Inquiry" > 2015-01-16 "Re: SEC Inquiry" > 2015-01-18 "Fwd: Re: SEC Inquiry" > > > And subsequently, we would like to aggregate all messages with a > particular reference subject. > For instance the question we are trying to answer could be : Get the count > of messages with a particular subject. > > Looking forward to any suggestion from you. > > > On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <soz...@yahoo.com> wrote: > >> Hi Suraj >> What will be your output after group by? Since GroupBy is for >> aggregations like sum, count etc. >> If you want to count the 2015 records than it is possible. >> >> Kind Regards >> Salih Oztop >> >> >> ------------------------------ >> *From:* Suraj Shetiya <surajshet...@gmail.com> >> *To:* user@spark.apache.org >> *Sent:* Tuesday, June 30, 2015 3:05 PM >> *Subject:* Spark Dataframe 1.4 (GroupBy partial match) >> >> I have a dataset (trimmed and simplified) with 2 columns as below. >> >> Date Subject >> 2015-01-14 "SEC Inquiry" >> 2014-02-12 "Happy birthday" >> 2014-02-13 "Re: Happy birthday" >> 2015-01-16 "Re: SEC Inquiry" >> 2015-01-18 "Fwd: Re: SEC Inquiry" >> >> I have imported the same in a Spark Dataframe. What I am looking at is >> groupBy subject field (however, I need a partial match to identify the >> discussion topic). >> >> For example in the above case.. I would like to group all messages, which >> have subject containing "SEC Inquiry" which returns following grouped >> frame: >> >> 2015-01-14 "SEC Inquiry" >> 2015-01-16 "Re: SEC Inquiry" >> 2015-01-18 "Fwd: Re: SEC Inquiry" >> >> Another usecase for a similar problem could be group by year (in the >> above example), it would mean partial match of the date field, which would >> mean groupBy Date by matching year as "2014" or "2015". >> >> Keenly Looking forward to reply/solution to the above. >> >> - Suraj >> >> >> >> >> >