You've pretty much got it actually. Don't bother trying to use an empty bag.
The only thing I've ever gotten to work in such situations is the MAX udf as
opposed to FLATTEN. That's assuming you've only got one value per (uid,key)
tuple of course. Here's how to modify your last line:
flattened = FOREACH cogrouped {
age = (IsEmpty(by_age) ? 'null' : MAX(by_age.value));
colour= (IsEmpty(by_colour) ? 'null' :
MAX(by_colour.value));
food = (IsEmpty(by_food) ? 'null' :
MAX(by_food.value));
GENERATE group, age, colour, food;
Hurray.
What happens with 10 keys? 100? There must be a better way. Anyone else want to
chime in?
--jacob
@thedatachef
Sent from my HTC Inspire⢠4G on AT&T
----- Reply message -----
From: "Mat Kelcey" <[email protected]>
To: <[email protected]>
Subject: trouble with syntax for flatten in a foreach
Date: Mon, Jul 11, 2011 12:47 am
hi,
i've got a pretty simple transform of data i need to do and i can't for the
life of me work it out.
i feel like i'm missing something trivial...
i want to go from this...
person key value
bob age 25
bob colour red
fred age 30
fred food bagels
to this...
person age colour food
bob 25 red null
fred 30 null bagels
here's the best i can do....
> data = load 'blah' as (uid:chararray, key:chararray, value:chararray);
-- data: {uid: chararray,key: chararray,value: chararray}
(bob,age,25)
(bob,colour,red)
(fred,age,30)
(fred,food,bagels)
> split data into
by_age if key=='age',
by_colour if key=='colour',
by_food if key=='food';
> cogrouped = cogroup by_age by uid, by_colour by uid, by_food by uid;
-- cogrouped: {group: chararray,by_age: {(uid: chararray,key:
chararray,value: chararray)},by_colour: {(uid: chararray,key:
chararray,value: chararray)},by_food: {(uid: chararray,key: chararray,value:
chararray)}}
(bob,{(bob,age,25)},{(bob,colour,red)},{})
(fred,{(fred,age,30)},{},{(fred,food,bagels)})
> flattened = foreach cogrouped generate group as uid, by_age.value as age,
by_colour.value as colour, by_food.value as food;
-- flattened: {uid: chararray,age: {(value: chararray)},colour: {(value:
chararray)},food: {(value: chararray)}}
(bob,{(25)},{(red)},{})
(fred,{(30)},{},{(bagels)})
any attempt to call flatten on the tuples, eg
> flattened = foreach cogrouped generate group as uid,
flatten(by_food.value) as food;
and i lose the entries that had a empty bag for food (eg bob in this case)
i've got a feeling isempty might get me somewhere and
> flattened = foreach cogrouped generate
group as uid,
(IsEmpty(by_food.value) ? 0 : 1);
(bob,0)
(fred,1)
but any attempt to use a real value in there fails, i can't get the syntax
correct.
> flattened = foreach cogrouped generate
group as uid,
(IsEmpty(by_food.value) ? {} : by_food.value);
not sure how to define an empty bag for the left hand side of the bin cond?
i must be missing something fundamental somewhere.
help me obiwan kanobi, you're my only hope.
cheers,
mat