register piggybank.jar;

X1 = FOREACH X GENERATE $0 as f1, 
org.apache.pig.piggybank.evaluation.string.REPLACE($1,'-',',') as temp;
X2 = FOREACH X1 GENERATE f1, FLATTEN(TOKENIZE(temp)) as (f2);
Y2 = FOREACH Y GENERATE $0 as f1, $1 as f2;
Joined = JOIN X2 BY f2, Y BY f1 PARALLEL <your-parallel-value>;
Final = FOREACH Joined GENERATE
            X2::f1 as f1,
            X2::f2 as f2,
            Y2::f2 as f3;
Dump Final;

-...@nkur

On 11/4/10 3:12 PM, "Anze" <[email protected]> wrote:

Hi all!

I have a problem that I can't find solution to... Hope someone can shed some
light. :)

-----
grunt> dump X;
(1,a-b-c)
(2,d-a)
(3,c)
-----
(where $1 is a chararray)

I would like to generate this relation from it:
-----
(1,a)
(1,b)
(1,c)
(2,d)
(2,a)
(3,c)
-----

Can this be done?

More background: what I would actually like to do is inner join on two
relations, one as specified above (relation X) and the other that has
'a','b','c','d'... as values:
-----
grunt> dump Y;
(a,aaa)
(b,bbb)
(c,ccc)
...
-----
So this is the end result I am looking for:
(1,a,a,aaa)
(1,b,b,bbb)
(1,c,c,ccc)
(2,d,d,ddd)
(2,a,a,aaa)
(3,c,c,ccc)

One idea: I could make a cross join and keep only records (by using filter +
matches) where Y.$0 is contained in X.$1. But that seems very inefficient to
me. Is there a better way?

Thanks for any pointers,

Anze

Reply via email to