Try this:

by_clusters = GROUP sample_data by (cluster_id, terms);
by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group),
COUNT(sample_data)
as count;

Cheers,

--
Gianmarco


On 29 July 2014 11:49, Arian Pasquali <ar...@arianpasquali.com> wrote:

> Hi,
>
> I'm having trouble with a simple task that I believe someone out there must
> have already solved some day.
>
> I'm trying to group and count the frequency of terms for each group in
> PigLatin, but I'm having some troubles to figure it out how to do it.
>
> I have a collection of objects with the following schema:
>
> {cluster_id: bytearray,terms: chararray}
>
> And here is some samples
>
> (10, smerter)
> (10, graviditeten)
> (10, smerter)
> (10, smerter)
> (10, udemærket)
> (20, eis feuer)
> (20, herunterladen schau)
> (20, download gratis)
> (20, download gratis)
> (30, anschauen kinofilm)
> (30, kauf rechnung)
> (30, kauf rechnung)
> (30, versandkostenfreie lieferung)
> (30, kostenlose)
> (30, kostenlose)
> (30, kostenlose)
>
> the result I m trying to get is something like this
>
> (10, smerter, 3)
> (10, graviditeten, 2)
> (10, udemærket, 1)
> (20, download gratis, 2)
> (20, eis feuer, 1)
> (20, herunterladen schau, 1)
> (30, kostenlose, 3)
> (30, kauf rechnung, 2)
> (30, anschauen kinofilm, 1)
> (30, versandkostenfreie lieferung, 1)
>
> What would be the best way to do that? The following code groups by id and
> count the terms, but I wanted to count the terms for each group.
>
> by_clusters = GROUP sample_data by cluster_id;
> by_clusters_terms_count = FOREACH by_clusters GENERATE group as
> cluster_id, COUNT($1);
>
> I make the grouping like this I end up with an object with the following
> schema
>
> by_clusters: {group: bytearray,sample_data: {(cluster_id:
> bytearray,terms: chararray)}}
>
> Now, I get to the point to actually count the terms inside the
> 'sample_data' tuple. I'm thinking about nested foreach, but I still didn't
> get it how could I apply it in this case. The code would be something like
> the following:
>
> result = FOREACH by_clusters {
>
> --count terms here, I don't know how
>
> -- compiler gives me an error here
> c = GROUP $1 BY terms; --
> d = FOREACH c GENERATE COUNT(b), group;
>
> GENERATE cluster_id, d;
> }
>
> Error I get:
>
> ERROR 1200: Syntax error, unexpected symbol at or near '$1
>
> Finally, I think I'm close, but I'm unable to solve it. I don't believe
> I'll have to write an UDF in this case.
>
>
> Arian
>

Reply via email to