As far as I know, PigStorage cannot handle complex data types such as Bags (It's just a delimiter seperated file). You might have to restructure your data or use a different storage function or write a custom storage function. Since your datamodel is modeled after OO, you might be able to leverage Avro to maintain your datamodel.
On Wed, May 22, 2013 at 10:51 PM, Ho Duc Ha <hodu...@gmail.com> wrote: > We changed the load statement to: > > X = load 'data3' using PigStorage() as ( a:chararray, b:bag{(c:chararray)} > ); > > But we get the same results with your statement: > > Y = FOREACH X GENERATE b; > dump Y; > > output (of above command) > ----------------------------------------- > () > > What we really want to create is a set of the tuples in the bag b > ('5'),('6') > > Another example which seems to fail to load properly is this (using ints > instead of strings): > > file: data4 > ------------- > ( 3, {(5),(6)} ) > > X1 = load 'data4' using PigStorage() as ( a:int, b:bag{(c:int)} ); > dump X1; > > result: > --------- > (,) > > We also tried formatting the data like this, with the extra tuple around it > like I see in the output often, no luck: > ((3, {(5),(6)} )) > > > > > On Wed, May 22, 2013 at 11:32 PM, Sergey Goder <sergeygo...@gmail.com > >wrote: > > > Looks like you're probably not reading the data in correctly. Perhaps you > > need to specify the USING PigStorage() syntax and specify the correct > > delimiter as an argument. > > > > Also, if you want Y to just be the bag then you can just write it as; > > > > Y = FOREACH X GENERATE b; > > > > > > On Wed, May 22, 2013 at 8:51 AM, Ho Duc Ha <hodu...@gmail.com> wrote: > > > > > Actually I think you're right, the process in map/reduce isn't so > > > different. > > > > > > However, after trying to do this, we can't understand the output we see > > > below. We expected to see only '3' in alias Z, and '5' and '6' in alias > > Y, > > > neither result was as expected. > > > > > > X = load 'data3' as ( a:chararray, b:bag{(c:chararray)} ); > > > Y = foreach X { W = foreach b generate *; generate W; }; > > > Z = foreach X generate a; > > > > > > data3 > > > ( '3', {( '5' ),('6')} ) > > > > > > dump X > > > (( '3', {( '5' ),('6')} ),) > > > > > > dump Y > > > ({}) > > > > > > dump Z > > > (( '3', {( '5' ),('6')} )) > > > > > > > > > > > > > > > On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota < > pradeep...@gmail.com > > > >wrote: > > > > > > > Hi All, > > > > > > > > I'm a beginner pig user and this is my first post to the Pig mailing > > > list. > > > > > > > > Anyway, to answer your question, the first thing that comes to my > mind > > is > > > > that Pig may not be able to do a complex join like that. > > > > > > > > However, you can first flatten the bag in A, then do your join and > then > > > do > > > > a group by do get the result in the format you are looking for. This > > may > > > > not be an idea solution, but it should work. > > > > > > > > Pradeep > > > > > > > > > > > > On Wed, May 22, 2013 at 8:49 AM, Ho Duc Ha <hodu...@gmail.com> > wrote: > > > > > > > > > We've got a data type that is modeled after a typical > object-oriented > > > > > data-model format (simple fields, and collections of other > objects). > > > > We're > > > > > trying to accomplish the following join: > > > > > > > > > > Here's out example input: > > > > > ------------------------------------- > > > > > data1 = { ( 'a1', { ('a2-thing1'), ('a2-thing2') } ) } > > > > > data2 = { ( 'a2-thing1', 'x-value1' ), ( 'a2-thing1', 'x-value2' ) > > } > > > > > > > > > > Here's what we want to get: > > > > > -------------------------------------- > > > > > ( 'a1', { ('a2-thing1', { > > > > > ('x-value1'), ('x-value2') } > > > > > ) } > > > > > ) > > > > > > > > > > Notice that we are trying to join the collection of a2 fields of > the > > > 1st > > > > > data set, on the first field in the 2nd data set. > > > > > > > > > > We tried this: > > > > > -------------------- > > > > > A = load 'data1' as ( a:tuple(a1:chararray, > a2:bag{(a2t:chararray)}) > > ); > > > > > B = load 'data2' as ( a2t:chararray, x:chararray ); > > > > > X = join A by a2.a2t, B by a2t; > > > > > > > > > > We get this error: > > > > > --------------------------- > > > > > ERROR 1128: Cannot find field a2t in > > > > > a1:chararray,a2:bag{:tuple(a2t:chararray)} > > > > > > > > > > Try as we might, we cannot find the right way to do this complex > > join. > > > > > Questions: > > > > > 1) Should we be simplifying our data format into a more SQL > > > table-like > > > > > structure and doing more joins to reduce the complexity? > > > > > 2) How can we accomplish joining data2's data into the data1 > > > "objects"? > > > > > > > > > > -- > > > > > Ho Duc Ha > > > > > > > > > > > > > > > > > > > > > -- > > > Ho Duc Ha > > > > > > > > > -- > Ho Duc Ha >