If I understand you correctly, what you want in the end is a bag with all
distinct ids from the original dataset, regardless of the row label.  The
following will get you that (if thats what your looking for).  Note, that in
the for LOAD statement, I specified a comma as the delimiter.

a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
label:chararray);

b = FOREACH a GENERATE id;

c = GROUP b BY id;

d = FOREACH c GENERATE group;

e = GROUP d ALL;

dump e

(all,{(12),(45),(67),(78)})




On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <[email protected]> wrote:

> I'm trying to join together several different sources of synonyms using
> Pig.
>  For example:
>
> A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> label:chararray);
> DUMP A;
> (12,synonym1)
> (12,alternative_name)
> (45,synonym1 full name and description)
> (45,synonym1)
> (45,synonym1_expanded)
> (78,synonym1)
> (67,synonym1)
>
> I've managed to group things together by the label...
>
> C = GROUP A BY label;
> DUMP C;
> (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> (alternative_name,{(12,alternative_name)})
> (synonym1_expanded,{(45,synonym1_expanded)})
> (synonym1 full name and description,{(45,synonym1 full name and
> description)})
>
> And then flatten them out a little bit:
>
> D = FOREACH C GENERATE $0, $1.id;
> DUMP D;
> (synonym1,{(12),(45),(67)})
> (alternative_name,{(12),(78)})
> (synonym1_expanded,{(45)})
> (synonym1 full name and description,{(45)})
>
>
> If you look closely at the data, it turns out that this example test data
> set is really all the same - the synonyms all overlap.  The final output
> I'd
> like to get to is something like this (the arbitrary_id could be anything,
> I
> really just need a set of the overlapping IDs):
>
> (arbitrary_id, {12, 45, 67, 78})
>
> How can I join on the bag of IDs in 'D' to find other labels that have at
> least one of the same IDs?  Or am I approaching this the wrong way?
>
> Thanks,
>
> Mike
>



-- 

Thanks,
John C

Reply via email to