Re: Advice on algorithm for joining data in bags

Mike Hugo Wed, 13 Jul 2011 08:35:44 -0700

Thanks so much for the input John!  That's not quite what I'm looking for -
I realize now that my example is not fully complete.  There may be different
sets of synonyms in the input file.  For example:


12 synonym1
12 alternative_name
45 synonym1 full name and description
45 synonym1
45 synonym1_expanded
78 alternative_name
67 synonym1
34 synonym2
34 synonym2_expanded
56 synonym2
89 synonym2_expanded

Then the desired output would be:

(arbitrary_id_1, {12, 45, 67, 78})
(arbitrary_id_2, {34, 56, 89})

(34 has a synonym that matches 56, and 34 has a synonym that matches 89,
therefore the set of IDs for synonym2 is 34, 56, 89)

The arbitrary ID could be a row label, but it doesn't really matter, what
I'm really interested in is the bag of ids.

Mike

On Wed, Jul 13, 2011 at 10:13 AM, John Conwell <[email protected]> wrote:

> If I understand you correctly, what you want in the end is a bag with all
> distinct ids from the original dataset, regardless of the row label.  The
> following will get you that (if thats what your looking for).  Note, that
> in
> the for LOAD statement, I specified a comma as the delimiter.
>
> a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
> label:chararray);
>
> b = FOREACH a GENERATE id;
>
> c = GROUP b BY id;
>
> d = FOREACH c GENERATE group;
>
> e = GROUP d ALL;
>
> dump e
>
> (all,{(12),(45),(67),(78)})
>
>
>
>
> On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <[email protected]> wrote:
>
> > I'm trying to join together several different sources of synonyms using
> > Pig.
> >  For example:
> >
> > A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> > label:chararray);
> > DUMP A;
> > (12,synonym1)
> > (12,alternative_name)
> > (45,synonym1 full name and description)
> > (45,synonym1)
> > (45,synonym1_expanded)
> > (78,synonym1)
> > (67,synonym1)
> >
> > I've managed to group things together by the label...
> >
> > C = GROUP A BY label;
> > DUMP C;
> > (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> > (alternative_name,{(12,alternative_name)})
> > (synonym1_expanded,{(45,synonym1_expanded)})
> > (synonym1 full name and description,{(45,synonym1 full name and
> > description)})
> >
> > And then flatten them out a little bit:
> >
> > D = FOREACH C GENERATE $0, $1.id;
> > DUMP D;
> > (synonym1,{(12),(45),(67)})
> > (alternative_name,{(12),(78)})
> > (synonym1_expanded,{(45)})
> > (synonym1 full name and description,{(45)})
> >
> >
> > If you look closely at the data, it turns out that this example test data
> > set is really all the same - the synonyms all overlap.  The final output
> > I'd
> > like to get to is something like this (the arbitrary_id could be
> anything,
> > I
> > really just need a set of the overlapping IDs):
> >
> > (arbitrary_id, {12, 45, 67, 78})
> >
> > How can I join on the bag of IDs in 'D' to find other labels that have at
> > least one of the same IDs?  Or am I approaching this the wrong way?
> >
> > Thanks,
> >
> > Mike
> >
>
>
>
> --
>
> Thanks,
> John C
>

Re: Advice on algorithm for joining data in bags

Reply via email to