Ah, ok, that was very helpful, thanks. I've been able to flatten things out
now. So now I'm trying to re-group 2 levels of bags that I flattened (after
doing a join).
After some flatten and join operations I end up with data that looks like
this:
('item1',111,'thing1',222,'value1','result1')
('item1',111,'thing1',222,'value2','result2')
I need to send this data to a UDF, once per 'item', so I need to re-group 2
levels up (group values & things). With 5 somewhat inelegant steps I managed
to re-group the values correctly and project out just the 'thing' fields:
('thing1',222,{('value1','result1')})
('thing1',222,{('value2','result2')})
But now I want to take that form and turn it into something like this (so I
can re-join that to the original dataset):
'thing1', {222, {('value1', 'result1'), ('value2', 'result2')} }
I can't seem to make that happen with a GROUP operation, grouping on the bag
gives an error that the operation isn't supported yet, grouping on 'thing'
along doesn't yield a useful result...
====================================
For some context, the original problem is this:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray,
d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
grunt> cat data1
'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
grunt> cat data2
'value1' 'result1'
'value2' 'result2'
We're trying to join the 'result1', 'result2' values in [data2] into the
structure in [data1]. Then we need to call a UDF once per item so it can
output the data in a specific format.
An 'item' has 0 or more 'things', a thing has 0 or more 'values', and a
value may or may not have a 'result' (Simple OO structure with nested
collections, or 3 straight forward SQL tables, for comparison).
-----Original Message-----
From: Russell Jurney [mailto:[email protected]]
Sent: Tuesday, June 04, 2013 6:53 PM
To: [email protected]
Subject: Re: Flattening nested bags
B = foreach A generate item, d, flatten(things); C = foreach B generate
item, d, thing, d1, flatten(values);
Sent from my iPhone
On Jun 4, 2013, at 5:46 PM, "David Parks" <[email protected]> wrote:
> We've been at our first real use case with pig for quite some time
> now, and still not successful. I wonder if someone can provide an
> answer to this very much simplified version of our problem:
>
> Input data:
> ---------------
> 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
>
> Load statement for above data:
> ----------------------------------------
> A = load 'data6' as ( item:chararray, d:int,
> things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
>
> Desired result:
> ------------------
> ('item1' 111 thing1 222 value1)
> ('item1' 111 thing1 222 value2)
>
> Questions:
> ----------------
> - Is there a single step I can use to flatten this? Or will it require
> doing 2 steps: first flatten 'things', and then take those results and
> flatten 'values'?
> - We're really looking for the syntax to get this right. I've posted a
> number of questions here and on Stack Overflow with lots of good
> suggestions, and read through the O'Reilly book online, none of which,
> though, have gotten me past constant errors like "Cannot find field v
> in values:bag{:tuple(v:chararray)}"
> - Should I be working on converting our data to SQL-like table formats
> rather than this more Object-Oriented format with nested collections?
>
> Psudo-code attempt (I've tried 50+ versions of this in every form I
> can gleen from examples out on the internet with no success):
> ----------------------------------------------------
> B = FOREACH A GENERATE item, d, things.thing as thing, d1,
> FLATTEN(things.values.v) as v;
>
>
>