I would group on the label column, and then just take the distinct values in the id column. You may need to make a UDF or just do some processing to turn synonym2_expanded into synonym2, but it sounds like that's what you want to do. I guess I'm not sure how alternative_name works into this?
2011/7/13 Mike Hugo <[email protected]> > Thanks so much for the input John! That's not quite what I'm looking for - > I realize now that my example is not fully complete. There may be > different > sets of synonyms in the input file. For example: > > 12 synonym1 > 12 alternative_name > 45 synonym1 full name and description > 45 synonym1 > 45 synonym1_expanded > 78 alternative_name > 67 synonym1 > 34 synonym2 > 34 synonym2_expanded > 56 synonym2 > 89 synonym2_expanded > > Then the desired output would be: > > (arbitrary_id_1, {12, 45, 67, 78}) > (arbitrary_id_2, {34, 56, 89}) > > (34 has a synonym that matches 56, and 34 has a synonym that matches 89, > therefore the set of IDs for synonym2 is 34, 56, 89) > > The arbitrary ID could be a row label, but it doesn't really matter, what > I'm really interested in is the bag of ids. > > Mike > > On Wed, Jul 13, 2011 at 10:13 AM, John Conwell <[email protected]> wrote: > > > If I understand you correctly, what you want in the end is a bag with all > > distinct ids from the original dataset, regardless of the row label. The > > following will get you that (if thats what your looking for). Note, that > > in > > the for LOAD statement, I specified a comma as the delimiter. > > > > a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray, > > label:chararray); > > > > b = FOREACH a GENERATE id; > > > > c = GROUP b BY id; > > > > d = FOREACH c GENERATE group; > > > > e = GROUP d ALL; > > > > dump e > > > > (all,{(12),(45),(67),(78)}) > > > > > > > > > > On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <[email protected]> wrote: > > > > > I'm trying to join together several different sources of synonyms using > > > Pig. > > > For example: > > > > > > A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray, > > > label:chararray); > > > DUMP A; > > > (12,synonym1) > > > (12,alternative_name) > > > (45,synonym1 full name and description) > > > (45,synonym1) > > > (45,synonym1_expanded) > > > (78,synonym1) > > > (67,synonym1) > > > > > > I've managed to group things together by the label... > > > > > > C = GROUP A BY label; > > > DUMP C; > > > (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)}) > > > (alternative_name,{(12,alternative_name)}) > > > (synonym1_expanded,{(45,synonym1_expanded)}) > > > (synonym1 full name and description,{(45,synonym1 full name and > > > description)}) > > > > > > And then flatten them out a little bit: > > > > > > D = FOREACH C GENERATE $0, $1.id; > > > DUMP D; > > > (synonym1,{(12),(45),(67)}) > > > (alternative_name,{(12),(78)}) > > > (synonym1_expanded,{(45)}) > > > (synonym1 full name and description,{(45)}) > > > > > > > > > If you look closely at the data, it turns out that this example test > data > > > set is really all the same - the synonyms all overlap. The final > output > > > I'd > > > like to get to is something like this (the arbitrary_id could be > > anything, > > > I > > > really just need a set of the overlapping IDs): > > > > > > (arbitrary_id, {12, 45, 67, 78}) > > > > > > How can I join on the bag of IDs in 'D' to find other labels that have > at > > > least one of the same IDs? Or am I approaching this the wrong way? > > > > > > Thanks, > > > > > > Mike > > > > > > > > > > > -- > > > > Thanks, > > John C > > >
