Thanks Salih. :)

The output of the groupby is as below.

2015-01-14      "SEC Inquiry"
2015-01-16       "Re: SEC Inquiry"
2015-01-18       "Fwd: Re: SEC Inquiry"


And subsequently, we would like to aggregate all messages with a particular
reference subject.
For instance the question we are trying to answer could be : Get the count
of messages with a particular subject.

Looking forward to any suggestion from you.

On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <soz...@yahoo.com> wrote:

> Hi Suraj
> What will be your output after group by? Since GroupBy is for aggregations
> like sum, count etc.
> If you want to count the 2015 records than it is possible.
>
> Kind Regards
> Salih Oztop
>
>
>   ------------------------------
>  *From:* Suraj Shetiya <surajshet...@gmail.com>
> *To:* user@spark.apache.org
> *Sent:* Tuesday, June 30, 2015 3:05 PM
> *Subject:* Spark Dataframe 1.4 (GroupBy partial match)
>
> I have a dataset (trimmed and simplified) with 2 columns as below.
>
> Date                Subject
> 2015-01-14      "SEC Inquiry"
> 2014-02-12       "Happy birthday"
> 2014-02-13       "Re: Happy birthday"
> 2015-01-16       "Re: SEC Inquiry"
> 2015-01-18       "Fwd: Re: SEC Inquiry"
>
> I have imported the same in a Spark Dataframe. What I am looking at is
> groupBy subject field (however, I need a partial match to identify the
> discussion topic).
>
> For example in the above case.. I would like to group all messages, which
> have subject containing "SEC Inquiry" which returns following grouped
> frame:
>
> 2015-01-14      "SEC Inquiry"
> 2015-01-16       "Re: SEC Inquiry"
> 2015-01-18       "Fwd: Re: SEC Inquiry"
>
> Another usecase for a similar problem could be group by year (in the above
> example), it would mean partial match of the date field, which would mean
> groupBy Date by matching year as "2014" or "2015".
>
> Keenly Looking forward to reply/solution to the above.
>
> - Suraj
>
>
>
>
>

Reply via email to