Re: why the foreach nested form can't work?

勇胡 Wed, 20 Jul 2011 02:34:47 -0700

Thanks for your response. Now I just think that in which kind of situation I
can use "." to reference the field. In pig, if I understand right, each
relation is a bag. If I issue these commands:


A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
int);
B = FILTER A BY A.score>80;

There is no problem at compile time and the pig code can execute, but
finally I can't get error results. As you mentioned, A.score is a bag and 80
is a constant, they are not compatible. There are really big differences
than SQL. If I use:

B = FILTER A BY score>80; there is no problem, the statement can execute the
filter semantics.

The same problem will occur in the operators "group, cogroup, join, split,
order, cross". The input of these operators only support fields, not bags
(if I use "." to reference the field, I get wrong output information). If
these normal operators can not support "bag" operations, I can't see why the
pig needs bag type, as the operators can only support flatten type.

Regards!

Yong
2011/7/19 Jacob Perkins <[email protected]>

> On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
> > How can I understand that 'A.score' is a bag? I mean that if I issue a
> > 'describe B' command, I can get B: {group:int, A: {name:chararray,
> > no:int,score:int}}.
> Looking at the output of describe shows that A is bag (eg. the '{' and
> '}' characters), yes? So 'A.score' is simply the bag of all the scores
> in the group. You can go further and get a bag of both the scores and
> numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
> first.
>
> > From here, I can't get any information that 'A.score' is
> > a bag, but I can see that A.score is an element of bag.
> Not true. 'score' is the name of the field. 'A.score' is a bag of just
> the scores. Using the dot '.' is a way of pulling out specific fields
> from every tuple within a bag to result in another bag. Consider:
>
> A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> B = GROUP A BY no;
> DUMP B;
>
> (1,{(henrietta,1,25),(sally,1,82)})
> (3,{(fred,3,120)})
> (4,{(elsie,4,45)})
>
> C = FOREACH B GENERATE A.score;
> DUMP C;
>
> ({(25),(82)})
> ({(120)})
> ({(45)})
>
> Got it?
>
> > And why if I delete the quantifier 'A.', it works?
> >
> > I just changed my pig code as
> >
> > A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int,
> score:
> > int);
> > B = GROUP A BY no;
> > C =  FOREACH B {
> >     D = FILTER A BY score > 80;
> >     GENERATE D.name, D.score;}
> > DUMP C;
> >
> > I got an empty bag!
> 'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
> at the end as in the example
>
> >
> > The input is as:
> > henrietta       1       25
> > sally   1       82
> > fred    3       120
> > elsie   4       45
> >
> > The output is as:
> > ({(sally)},{(82)})
> > ({(fred)},{(120)})
> > ({},{})
> >
> > As you see, I got an empty tuple? why?
> There are three tuples, one for each group (1, 3, and 4). The filter
> condition left the bags from group 4 empty since the only tuple,
> (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
> empty ones are discarded.
>
> --jacob
> @thedatachef
>
> >
> > Yong
> >
> > 2011/7/19 Jacob Perkins <[email protected]>
> >
> > > I think it's because 'A.score' is a bag but Pig needs a reference to a
> > > field in the tuples. This worked for me:
> > >
> > > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > > B = GROUP A BY no;
> > > C = FOREACH B {
> > >       D = FILTER A BY score > 80;
> > >      GENERATE FLATTEN(D.(name, score));
> > >    };
> > > DUMP C;
> > >
> > > on the following data:
> > >
> > > $: cat foo.tsv
> > > henrietta       1       25
> > > sally   1       82
> > > fred    3       120
> > > elsie   4       45
> > >
> > > yields:
> > >
> > >
> > > Does that work for you?
> > >
> > > --jacob
> > > @thedatachef
> > >
> > > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> > > > int);
> > > > B = GROUP A BY no;
> > > > C =  FOREACH B {
> > > >     D = FILTER A BY A.score > 80;
> > > >     GENERATE D.name, D.score;}
> > > > DUMP C;
> > >
> > >
>
>
>

Re: why the foreach nested form can't work?

Reply via email to