I think there's probably some convoluted way to do this. First thing you'll
have to do is flatten your data.
data1 = A, B
_____
X, X1
X, X2
Y, Y1
Y, Y2
Y, Y3
Then do a join by "B" onto you second dataset. This should produce the
following
data2 = data1::A, data1::B, data2::A, data2::B, data2::C (I'm assuming data
set has exactly 4 columns).
_______________
X, X1, X1, 4, 5, 6
X, X2, X2, 3, 7, 3
Now do a group by data1::A to get
{X, {(X, X1, X1, 4, 5, 6), (X, X2, X2, 3, 7, 3), ...}}
{Y, {(Y, Y1, Y1, ...), (Y, Y2, Y2, ...), ...}}
This is as far as I got, I'm not sure if there's a built-in UDF to
transform that into what you're looking for. I thought maybe BagToTuple,
but it will return a single tuple with all elements of all tuples in the
bag. If the above data format supports your use cases, you're done. If not,
you can write a UDF to transform it into the required format.
On Wed, Sep 4, 2013 at 4:39 PM, F. Jerrell Schivers
<[email protected]>wrote:
Howdy folks,
Let's say I have a set of data that looks like this:
X, (X1, X2)
Y, (Y1, Y2, Y3)
So there could be an unknown number of members of each tuple per row.
I also have a second set of data that looks like this:
X1, 4, 5, 6
X2, 3, 7, 3
I'd like to join these such that I get:
X, (X1, 4, 5, 6), (X2, 3, 7, 3)
Y, (Y1, etc), (Y2, etc), (Y3, etc)
Is this possible with Pig?
Thanks,
Jerrell