Re: Advice on algorithm for joining data in bags

Jonathan Coveney Wed, 13 Jul 2011 09:01:46 -0700

I would group on the label column, and then just take the distinct values in
the id column. You may need to make a UDF or just do some processing to turn
synonym2_expanded into synonym2, but it sounds like that's what you want to
do. I guess I'm not sure how alternative_name works into this?


2011/7/13 Mike Hugo <[email protected]>

> Thanks so much for the input John!  That's not quite what I'm looking for -
> I realize now that my example is not fully complete.  There may be
> different
> sets of synonyms in the input file.  For example:
>
> 12 synonym1
> 12 alternative_name
> 45 synonym1 full name and description
> 45 synonym1
> 45 synonym1_expanded
> 78 alternative_name
> 67 synonym1
> 34 synonym2
> 34 synonym2_expanded
> 56 synonym2
> 89 synonym2_expanded
>
> Then the desired output would be:
>
> (arbitrary_id_1, {12, 45, 67, 78})
> (arbitrary_id_2, {34, 56, 89})
>
> (34 has a synonym that matches 56, and 34 has a synonym that matches 89,
> therefore the set of IDs for synonym2 is 34, 56, 89)
>
> The arbitrary ID could be a row label, but it doesn't really matter, what
> I'm really interested in is the bag of ids.
>
> Mike
>
> On Wed, Jul 13, 2011 at 10:13 AM, John Conwell <[email protected]> wrote:
>
> > If I understand you correctly, what you want in the end is a bag with all
> > distinct ids from the original dataset, regardless of the row label.  The
> > following will get you that (if thats what your looking for).  Note, that
> > in
> > the for LOAD statement, I specified a comma as the delimiter.
> >
> > a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
> > label:chararray);
> >
> > b = FOREACH a GENERATE id;
> >
> > c = GROUP b BY id;
> >
> > d = FOREACH c GENERATE group;
> >
> > e = GROUP d ALL;
> >
> > dump e
> >
> > (all,{(12),(45),(67),(78)})
> >
> >
> >
> >
> > On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <[email protected]> wrote:
> >
> > > I'm trying to join together several different sources of synonyms using
> > > Pig.
> > >  For example:
> > >
> > > A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> > > label:chararray);
> > > DUMP A;
> > > (12,synonym1)
> > > (12,alternative_name)
> > > (45,synonym1 full name and description)
> > > (45,synonym1)
> > > (45,synonym1_expanded)
> > > (78,synonym1)
> > > (67,synonym1)
> > >
> > > I've managed to group things together by the label...
> > >
> > > C = GROUP A BY label;
> > > DUMP C;
> > > (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> > > (alternative_name,{(12,alternative_name)})
> > > (synonym1_expanded,{(45,synonym1_expanded)})
> > > (synonym1 full name and description,{(45,synonym1 full name and
> > > description)})
> > >
> > > And then flatten them out a little bit:
> > >
> > > D = FOREACH C GENERATE $0, $1.id;
> > > DUMP D;
> > > (synonym1,{(12),(45),(67)})
> > > (alternative_name,{(12),(78)})
> > > (synonym1_expanded,{(45)})
> > > (synonym1 full name and description,{(45)})
> > >
> > >
> > > If you look closely at the data, it turns out that this example test
> data
> > > set is really all the same - the synonyms all overlap.  The final
> output
> > > I'd
> > > like to get to is something like this (the arbitrary_id could be
> > anything,
> > > I
> > > really just need a set of the overlapping IDs):
> > >
> > > (arbitrary_id, {12, 45, 67, 78})
> > >
> > > How can I join on the bag of IDs in 'D' to find other labels that have
> at
> > > least one of the same IDs?  Or am I approaching this the wrong way?
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>

Re: Advice on algorithm for joining data in bags

Reply via email to