Advice on algorithm for joining data in bags

Mike Hugo Tue, 12 Jul 2011 12:46:34 -0700

I'm trying to join together several different sources of synonyms using Pig.
 For example:


A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
label:chararray);
DUMP A;
(12,synonym1)
(12,alternative_name)
(45,synonym1 full name and description)
(45,synonym1)
(45,synonym1_expanded)
(78,synonym1)
(67,synonym1)

I've managed to group things together by the label...

C = GROUP A BY label;
DUMP C;
(synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
(alternative_name,{(12,alternative_name)})
(synonym1_expanded,{(45,synonym1_expanded)})
(synonym1 full name and description,{(45,synonym1 full name and
description)})

And then flatten them out a little bit:

D = FOREACH C GENERATE $0, $1.id;
DUMP D;
(synonym1,{(12),(45),(67)})
(alternative_name,{(12),(78)})
(synonym1_expanded,{(45)})
(synonym1 full name and description,{(45)})


If you look closely at the data, it turns out that this example test data
set is really all the same - the synonyms all overlap.  The final output I'd
like to get to is something like this (the arbitrary_id could be anything, I
really just need a set of the overlapping IDs):

(arbitrary_id, {12, 45, 67, 78})

How can I join on the bag of IDs in 'D' to find other labels that have at
least one of the same IDs?  Or am I approaching this the wrong way?

Thanks,

Mike

Advice on algorithm for joining data in bags

Reply via email to