Re: why the foreach nested form can't work?

Daniel Dai Wed, 20 Jul 2011 11:00:00 -0700

If you refer some field in the base relation, you only need to refer to
column name:
B = FILTER A BY score>80;


Here A is base relation, so you only need to say "score" instead of
"A.score". Otherwise, Pig will think you are using A as a scalar (
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
)

Daniel

2011/7/20 勇胡 <[email protected]>

> Thanks for your response. Now I just think that in which kind of situation
> I
> can use "." to reference the field. In pig, if I understand right, each
> relation is a bag. If I issue these commands:
>
> A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
> int);
> B = FILTER A BY A.score>80;
>
> There is no problem at compile time and the pig code can execute, but
> finally I can't get error results. As you mentioned, A.score is a bag and
> 80
> is a constant, they are not compatible. There are really big differences
> than SQL. If I use:
>
> B = FILTER A BY score>80; there is no problem, the statement can execute
> the
> filter semantics.
>
> The same problem will occur in the operators "group, cogroup, join, split,
> order, cross". The input of these operators only support fields, not bags
> (if I use "." to reference the field, I get wrong output information). If
> these normal operators can not support "bag" operations, I can't see why
> the
> pig needs bag type, as the operators can only support flatten type.
>
> Regards!
>
> Yong
> 2011/7/19 Jacob Perkins <[email protected]>
>
> > On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
> > > How can I understand that 'A.score' is a bag? I mean that if I issue a
> > > 'describe B' command, I can get B: {group:int, A: {name:chararray,
> > > no:int,score:int}}.
> > Looking at the output of describe shows that A is bag (eg. the '{' and
> > '}' characters), yes? So 'A.score' is simply the bag of all the scores
> > in the group. You can go further and get a bag of both the scores and
> > numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
> > first.
> >
> > > From here, I can't get any information that 'A.score' is
> > > a bag, but I can see that A.score is an element of bag.
> > Not true. 'score' is the name of the field. 'A.score' is a bag of just
> > the scores. Using the dot '.' is a way of pulling out specific fields
> > from every tuple within a bag to result in another bag. Consider:
> >
> > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > B = GROUP A BY no;
> > DUMP B;
> >
> > (1,{(henrietta,1,25),(sally,1,82)})
> > (3,{(fred,3,120)})
> > (4,{(elsie,4,45)})
> >
> > C = FOREACH B GENERATE A.score;
> > DUMP C;
> >
> > ({(25),(82)})
> > ({(120)})
> > ({(45)})
> >
> > Got it?
> >
> > > And why if I delete the quantifier 'A.', it works?
> > >
> > > I just changed my pig code as
> > >
> > > A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int,
> > score:
> > > int);
> > > B = GROUP A BY no;
> > > C =  FOREACH B {
> > >     D = FILTER A BY score > 80;
> > >     GENERATE D.name, D.score;}
> > > DUMP C;
> > >
> > > I got an empty bag!
> > 'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
> > at the end as in the example
> >
> > >
> > > The input is as:
> > > henrietta       1       25
> > > sally   1       82
> > > fred    3       120
> > > elsie   4       45
> > >
> > > The output is as:
> > > ({(sally)},{(82)})
> > > ({(fred)},{(120)})
> > > ({},{})
> > >
> > > As you see, I got an empty tuple? why?
> > There are three tuples, one for each group (1, 3, and 4). The filter
> > condition left the bags from group 4 empty since the only tuple,
> > (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
> > empty ones are discarded.
> >
> > --jacob
> > @thedatachef
> >
> > >
> > > Yong
> > >
> > > 2011/7/19 Jacob Perkins <[email protected]>
> > >
> > > > I think it's because 'A.score' is a bag but Pig needs a reference to
> a
> > > > field in the tuples. This worked for me:
> > > >
> > > > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > > > B = GROUP A BY no;
> > > > C = FOREACH B {
> > > >       D = FILTER A BY score > 80;
> > > >      GENERATE FLATTEN(D.(name, score));
> > > >    };
> > > > DUMP C;
> > > >
> > > > on the following data:
> > > >
> > > > $: cat foo.tsv
> > > > henrietta       1       25
> > > > sally   1       82
> > > > fred    3       120
> > > > elsie   4       45
> > > >
> > > > yields:
> > > >
> > > >
> > > > Does that work for you?
> > > >
> > > > --jacob
> > > > @thedatachef
> > > >
> > > > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int,
> score:
> > > > > int);
> > > > > B = GROUP A BY no;
> > > > > C =  FOREACH B {
> > > > >     D = FILTER A BY A.score > 80;
> > > > >     GENERATE D.name, D.score;}
> > > > > DUMP C;
> > > >
> > > >
> >
> >
> >
>

Re: why the foreach nested form can't work?

Reply via email to