Ah, ok, that was very helpful, thanks. I've been able to flatten things out
now. So now I'm trying to re-group 2 levels of bags that I flattened (after
doing a join).

After some flatten and join operations I end up with data that looks like
this:

        ('item1',111,'thing1',222,'value1','result1')
        ('item1',111,'thing1',222,'value2','result2')

I need to send this data to a UDF, once per 'item', so I need to re-group 2
levels up (group values & things). With 5 somewhat inelegant steps I managed
to re-group the values correctly and project out just the 'thing' fields:

        ('thing1',222,{('value1','result1')})
        ('thing1',222,{('value2','result2')})

But now I want to take that form and turn it into something like this (so I
can re-join that to the original dataset):

        'thing1', {222, {('value1', 'result1'), ('value2', 'result2')} }

I can't seem to make that happen with a GROUP operation, grouping on the bag
gives an error that the operation isn't supported yet, grouping on 'thing'
along doesn't yield a useful result...

====================================

For some context, the original problem is this:

A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray,
d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );

        grunt> cat data1
        'item1' 111     { ('thing1', 222, {('value1'),('value2')}) }
        grunt> cat data2
        'value1'        'result1'
        'value2'        'result2'

We're trying to join the 'result1', 'result2' values in [data2] into the
structure in [data1]. Then we need to call a UDF once per item so it can
output the data in a specific format. 

An 'item' has 0 or more 'things', a thing has 0 or more 'values', and a
value may or may not have a 'result' (Simple OO structure with nested
collections, or 3 straight forward SQL tables, for comparison).



-----Original Message-----
From: Russell Jurney [mailto:[email protected]] 
Sent: Tuesday, June 04, 2013 6:53 PM
To: [email protected]
Subject: Re: Flattening nested bags

B = foreach A generate item, d, flatten(things); C = foreach B generate
item, d, thing, d1, flatten(values);

Sent from my iPhone

On Jun 4, 2013, at 5:46 PM, "David Parks" <[email protected]> wrote:

> We've been at our first real use case with pig for quite some time 
> now, and still not successful. I wonder if someone can provide an 
> answer to this very much simplified version of our problem:
> 
> Input data:
> ---------------
> 'item1' 111     { ('thing1', 222, {('value1'),('value2')}) }
> 
> Load statement for above data:
> ----------------------------------------
> A = load 'data6' as ( item:chararray, d:int, 
> things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
> 
> Desired result:
> ------------------
> ('item1'        111    thing1    222    value1)
> ('item1'        111    thing1    222    value2)
> 
> Questions:
> ----------------
> - Is there a single step I can use to flatten this? Or will it require 
> doing 2 steps: first flatten 'things', and then take those results and 
> flatten 'values'?
> - We're really looking for the syntax to get this right. I've posted a 
> number of questions here and on Stack Overflow with lots of good 
> suggestions, and read through the O'Reilly book online, none of which, 
> though, have gotten me past constant errors like "Cannot find field v 
> in values:bag{:tuple(v:chararray)}"
> - Should I be working on converting our data to SQL-like table formats 
> rather than this more Object-Oriented format with nested collections?
> 
> Psudo-code attempt (I've tried 50+ versions of this in every form I 
> can gleen from examples out on the internet with no success):
> ----------------------------------------------------
> B = FOREACH A GENERATE item, d, things.thing as thing, d1,
> FLATTEN(things.values.v) as v;
> 
> 
> 

Reply via email to