Thanks Salih. :)
The output of the groupby is as below. 2015-01-14 "SEC Inquiry" 2015-01-16 "Re: SEC Inquiry" 2015-01-18 "Fwd: Re: SEC Inquiry" And subsequently, we would like to aggregate all messages with a particular reference subject. For instance the question we are trying to answer could be : Get the count of messages with a particular subject. Looking forward to any suggestion from you. On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <soz...@yahoo.com> wrote: > Hi Suraj > What will be your output after group by? Since GroupBy is for aggregations > like sum, count etc. > If you want to count the 2015 records than it is possible. > > Kind Regards > Salih Oztop > > > ------------------------------ > *From:* Suraj Shetiya <surajshet...@gmail.com> > *To:* user@spark.apache.org > *Sent:* Tuesday, June 30, 2015 3:05 PM > *Subject:* Spark Dataframe 1.4 (GroupBy partial match) > > I have a dataset (trimmed and simplified) with 2 columns as below. > > Date Subject > 2015-01-14 "SEC Inquiry" > 2014-02-12 "Happy birthday" > 2014-02-13 "Re: Happy birthday" > 2015-01-16 "Re: SEC Inquiry" > 2015-01-18 "Fwd: Re: SEC Inquiry" > > I have imported the same in a Spark Dataframe. What I am looking at is > groupBy subject field (however, I need a partial match to identify the > discussion topic). > > For example in the above case.. I would like to group all messages, which > have subject containing "SEC Inquiry" which returns following grouped > frame: > > 2015-01-14 "SEC Inquiry" > 2015-01-16 "Re: SEC Inquiry" > 2015-01-18 "Fwd: Re: SEC Inquiry" > > Another usecase for a similar problem could be group by year (in the above > example), it would mean partial match of the date field, which would mean > groupBy Date by matching year as "2014" or "2015". > > Keenly Looking forward to reply/solution to the above. > > - Suraj > > > > >