As far as I know, PigStorage cannot handle complex data types such as Bags
(It's just a delimiter seperated file). You might have to restructure your
data or use a different storage function or write a custom storage
function. Since your datamodel is modeled after OO, you might be able to
leverage Avro to maintain your datamodel.


On Wed, May 22, 2013 at 10:51 PM, Ho Duc Ha <hodu...@gmail.com> wrote:

> We changed the load statement to:
>
> X = load 'data3' using PigStorage() as ( a:chararray, b:bag{(c:chararray)}
> );
>
> But we get the same results with your statement:
>
> Y = FOREACH X GENERATE b;
> dump Y;
>
> output (of above command)
> -----------------------------------------
> ()
>
> What we really want to create is a set of the tuples in the bag b
> ('5'),('6')
>
> Another example which seems to fail to load properly is this (using ints
> instead of strings):
>
> file: data4
> -------------
> ( 3, {(5),(6)} )
>
> X1 = load 'data4' using PigStorage() as ( a:int, b:bag{(c:int)} );
> dump X1;
>
> result:
> ---------
> (,)
>
> We also tried formatting the data like this, with the extra tuple around it
> like I see in the output often, no luck:
> ((3, {(5),(6)} ))
>
>
>
>
> On Wed, May 22, 2013 at 11:32 PM, Sergey Goder <sergeygo...@gmail.com
> >wrote:
>
> > Looks like you're probably not reading the data in correctly. Perhaps you
> > need to specify the USING PigStorage() syntax and specify the correct
> > delimiter as an argument.
> >
> > Also, if you want Y to just be the bag then you can just write it as;
> >
> > Y = FOREACH X GENERATE b;
> >
> >
> > On Wed, May 22, 2013 at 8:51 AM, Ho Duc Ha <hodu...@gmail.com> wrote:
> >
> > > Actually I think you're right, the process in map/reduce isn't so
> > > different.
> > >
> > > However, after trying to do this, we can't understand the output we see
> > > below. We expected to see only '3' in alias Z, and '5' and '6' in alias
> > Y,
> > > neither result was as expected.
> > >
> > > X = load 'data3' as ( a:chararray, b:bag{(c:chararray)} );
> > > Y = foreach X { W = foreach b generate *; generate W; };
> > > Z = foreach X generate a;
> > >
> > > data3
> > > ( '3', {( '5' ),('6')} )
> > >
> > > dump X
> > > (( '3', {( '5' ),('6')} ),)
> > >
> > > dump Y
> > > ({})
> > >
> > > dump Z
> > > (( '3', {( '5' ),('6')} ))
> > >
> > >
> > >
> > >
> > > On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota <
> pradeep...@gmail.com
> > > >wrote:
> > >
> > > > Hi All,
> > > >
> > > > I'm a beginner pig user and this is my first post to the Pig mailing
> > > list.
> > > >
> > > > Anyway, to answer your question, the first thing that comes to my
> mind
> > is
> > > > that Pig may not be able to do a complex join like that.
> > > >
> > > > However, you can first flatten the bag in A, then do your join and
> then
> > > do
> > > > a group by do get the result in the format you are looking for. This
> > may
> > > > not be an idea solution, but it should work.
> > > >
> > > > Pradeep
> > > >
> > > >
> > > > On Wed, May 22, 2013 at 8:49 AM, Ho Duc Ha <hodu...@gmail.com>
> wrote:
> > > >
> > > > > We've got a data type that is modeled after a typical
> object-oriented
> > > > > data-model format (simple fields, and collections of other
> objects).
> > > > We're
> > > > > trying to accomplish the following join:
> > > > >
> > > > > Here's out example input:
> > > > > -------------------------------------
> > > > > data1 = {  ( 'a1', { ('a2-thing1'), ('a2-thing2') } )  }
> > > > > data2 = {  ( 'a2-thing1', 'x-value1' ), ( 'a2-thing1', 'x-value2' )
> >  }
> > > > >
> > > > > Here's what we want to get:
> > > > > --------------------------------------
> > > > > ( 'a1', { ('a2-thing1', {
> > > > > ('x-value1'), ('x-value2') }
> > > > > ) }
> > > > > )
> > > > >
> > > > > Notice that we are trying to join the collection of a2 fields of
> the
> > > 1st
> > > > > data set, on the first field in the 2nd data set.
> > > > >
> > > > > We tried this:
> > > > > --------------------
> > > > > A = load 'data1' as ( a:tuple(a1:chararray,
> a2:bag{(a2t:chararray)})
> > );
> > > > > B = load 'data2' as ( a2t:chararray, x:chararray );
> > > > > X = join A by a2.a2t, B by a2t;
> > > > >
> > > > > We get this error:
> > > > > ---------------------------
> > > > > ERROR 1128: Cannot find field a2t in
> > > > > a1:chararray,a2:bag{:tuple(a2t:chararray)}
> > > > >
> > > > > Try as we might, we cannot find the right way to do this complex
> > join.
> > > > > Questions:
> > > > >   1) Should we be simplifying our data format into a more SQL
> > > table-like
> > > > > structure and doing more joins to reduce the complexity?
> > > > >   2) How can we accomplish joining data2's data into the data1
> > > "objects"?
> > > > >
> > > > > --
> > > > > Ho Duc Ha
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Ho Duc Ha
> > >
> >
>
>
>
> --
> Ho Duc Ha
>

Reply via email to