Re: Advice on algorithm for joining data in bags

Mike Hugo Wed, 13 Jul 2011 09:11:35 -0700

Great thanks John!  I think I'm down the right path then.

To answer your final question about the alternative name - basically you can
consider each id as a distinct datasource of synonyms.  I'm trying to join
them all together in a single repository.  Looking at the example again,


12 synonym1
12 alternative_name
45 synonym1 full name and description
45 synonym1
45 synonym1_expanded
78 alternative_name
67 synonym1
34 synonym2
34 synonym2_expanded
56 synonym2
89 synonym2_expanded

12 has two "labels" - synonym1 and alternative_name.  synonym1 is found in
45, 12, and 67 so we now know 45, 12, and 67 are the same thing.
 alternative name is found in 12 and 78, so we now know that 12 and 78 are
the same thing.  12 is found in both the first set (45, 12, and 67) and the
second set (12, 78), so we now know those two sets are the same thing,
resulting in the desired output of (12, 45, 67, 78).  The same logic can be
applied to the next set of data:  synonym2 is found in 34 and 56, so they
are the same thing.  synonym2_expanded is found in 34 and 89, so they are
the same thing.  34 is found in both sets, so the final output for that
chunk of data is (34, 56, 89).

Thanks for the help, I'll keep playing around with this and take a look at
building a UDF.

Mike

On Wed, Jul 13, 2011 at 11:01 AM, Jonathan Coveney <[email protected]>wrote:

> I would group on the label column, and then just take the distinct values
> in
> the id column. You may need to make a UDF or just do some processing to
> turn
> synonym2_expanded into synonym2, but it sounds like that's what you want to
> do. I guess I'm not sure how alternative_name works into this?
>
> 2011/7/13 Mike Hugo <[email protected]>
>
> > Thanks so much for the input John!  That's not quite what I'm looking for
> -
> > I realize now that my example is not fully complete.  There may be
> > different
> > sets of synonyms in the input file.  For example:
> >
> > 12 synonym1
> > 12 alternative_name
> > 45 synonym1 full name and description
> > 45 synonym1
> > 45 synonym1_expanded
> > 78 alternative_name
> > 67 synonym1
> > 34 synonym2
> > 34 synonym2_expanded
> > 56 synonym2
> > 89 synonym2_expanded
> >
> > Then the desired output would be:
> >
> > (arbitrary_id_1, {12, 45, 67, 78})
> > (arbitrary_id_2, {34, 56, 89})
> >
> > (34 has a synonym that matches 56, and 34 has a synonym that matches 89,
> > therefore the set of IDs for synonym2 is 34, 56, 89)
> >
> > The arbitrary ID could be a row label, but it doesn't really matter, what
> > I'm really interested in is the bag of ids.
> >
> > Mike
> >
> > On Wed, Jul 13, 2011 at 10:13 AM, John Conwell <[email protected]> wrote:
> >
> > > If I understand you correctly, what you want in the end is a bag with
> all
> > > distinct ids from the original dataset, regardless of the row label.
>  The
> > > following will get you that (if thats what your looking for).  Note,
> that
> > > in
> > > the for LOAD statement, I specified a comma as the delimiter.
> > >
> > > a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
> > > label:chararray);
> > >
> > > b = FOREACH a GENERATE id;
> > >
> > > c = GROUP b BY id;
> > >
> > > d = FOREACH c GENERATE group;
> > >
> > > e = GROUP d ALL;
> > >
> > > dump e
> > >
> > > (all,{(12),(45),(67),(78)})
> > >
> > >
> > >
> > >
> > > On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <[email protected]> wrote:
> > >
> > > > I'm trying to join together several different sources of synonyms
> using
> > > > Pig.
> > > >  For example:
> > > >
> > > > A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> > > > label:chararray);
> > > > DUMP A;
> > > > (12,synonym1)
> > > > (12,alternative_name)
> > > > (45,synonym1 full name and description)
> > > > (45,synonym1)
> > > > (45,synonym1_expanded)
> > > > (78,synonym1)
> > > > (67,synonym1)
> > > >
> > > > I've managed to group things together by the label...
> > > >
> > > > C = GROUP A BY label;
> > > > DUMP C;
> > > > (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> > > > (alternative_name,{(12,alternative_name)})
> > > > (synonym1_expanded,{(45,synonym1_expanded)})
> > > > (synonym1 full name and description,{(45,synonym1 full name and
> > > > description)})
> > > >
> > > > And then flatten them out a little bit:
> > > >
> > > > D = FOREACH C GENERATE $0, $1.id;
> > > > DUMP D;
> > > > (synonym1,{(12),(45),(67)})
> > > > (alternative_name,{(12),(78)})
> > > > (synonym1_expanded,{(45)})
> > > > (synonym1 full name and description,{(45)})
> > > >
> > > >
> > > > If you look closely at the data, it turns out that this example test
> > data
> > > > set is really all the same - the synonyms all overlap.  The final
> > output
> > > > I'd
> > > > like to get to is something like this (the arbitrary_id could be
> > > anything,
> > > > I
> > > > really just need a set of the overlapping IDs):
> > > >
> > > > (arbitrary_id, {12, 45, 67, 78})
> > > >
> > > > How can I join on the bag of IDs in 'D' to find other labels that
> have
> > at
> > > > least one of the same IDs?  Or am I approaching this the wrong way?
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> >
>

Re: Advice on algorithm for joining data in bags

Reply via email to