Try this: by_clusters = GROUP sample_data by (cluster_id, terms); by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group), COUNT(sample_data) as count;
Cheers, -- Gianmarco On 29 July 2014 11:49, Arian Pasquali <ar...@arianpasquali.com> wrote: > Hi, > > I'm having trouble with a simple task that I believe someone out there must > have already solved some day. > > I'm trying to group and count the frequency of terms for each group in > PigLatin, but I'm having some troubles to figure it out how to do it. > > I have a collection of objects with the following schema: > > {cluster_id: bytearray,terms: chararray} > > And here is some samples > > (10, smerter) > (10, graviditeten) > (10, smerter) > (10, smerter) > (10, udemærket) > (20, eis feuer) > (20, herunterladen schau) > (20, download gratis) > (20, download gratis) > (30, anschauen kinofilm) > (30, kauf rechnung) > (30, kauf rechnung) > (30, versandkostenfreie lieferung) > (30, kostenlose) > (30, kostenlose) > (30, kostenlose) > > the result I m trying to get is something like this > > (10, smerter, 3) > (10, graviditeten, 2) > (10, udemærket, 1) > (20, download gratis, 2) > (20, eis feuer, 1) > (20, herunterladen schau, 1) > (30, kostenlose, 3) > (30, kauf rechnung, 2) > (30, anschauen kinofilm, 1) > (30, versandkostenfreie lieferung, 1) > > What would be the best way to do that? The following code groups by id and > count the terms, but I wanted to count the terms for each group. > > by_clusters = GROUP sample_data by cluster_id; > by_clusters_terms_count = FOREACH by_clusters GENERATE group as > cluster_id, COUNT($1); > > I make the grouping like this I end up with an object with the following > schema > > by_clusters: {group: bytearray,sample_data: {(cluster_id: > bytearray,terms: chararray)}} > > Now, I get to the point to actually count the terms inside the > 'sample_data' tuple. I'm thinking about nested foreach, but I still didn't > get it how could I apply it in this case. The code would be something like > the following: > > result = FOREACH by_clusters { > > --count terms here, I don't know how > > -- compiler gives me an error here > c = GROUP $1 BY terms; -- > d = FOREACH c GENERATE COUNT(b), group; > > GENERATE cluster_id, d; > } > > Error I get: > > ERROR 1200: Syntax error, unexpected symbol at or near '$1 > > Finally, I think I'm close, but I'm unable to solve it. I don't believe > I'll have to write an UDF in this case. > > > Arian >