If you refer some field in the base relation, you only need to refer to column name: B = FILTER A BY score>80;
Here A is base relation, so you only need to say "score" instead of "A.score". Otherwise, Pig will think you are using A as a scalar ( http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars ) Daniel 2011/7/20 勇胡 <[email protected]> > Thanks for your response. Now I just think that in which kind of situation > I > can use "." to reference the field. In pig, if I understand right, each > relation is a bag. If I issue these commands: > > A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score: > int); > B = FILTER A BY A.score>80; > > There is no problem at compile time and the pig code can execute, but > finally I can't get error results. As you mentioned, A.score is a bag and > 80 > is a constant, they are not compatible. There are really big differences > than SQL. If I use: > > B = FILTER A BY score>80; there is no problem, the statement can execute > the > filter semantics. > > The same problem will occur in the operators "group, cogroup, join, split, > order, cross". The input of these operators only support fields, not bags > (if I use "." to reference the field, I get wrong output information). If > these normal operators can not support "bag" operations, I can't see why > the > pig needs bag type, as the operators can only support flatten type. > > Regards! > > Yong > 2011/7/19 Jacob Perkins <[email protected]> > > > On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote: > > > How can I understand that 'A.score' is a bag? I mean that if I issue a > > > 'describe B' command, I can get B: {group:int, A: {name:chararray, > > > no:int,score:int}}. > > Looking at the output of describe shows that A is bag (eg. the '{' and > > '}' characters), yes? So 'A.score' is simply the bag of all the scores > > in the group. You can go further and get a bag of both the scores and > > numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at > > first. > > > > > From here, I can't get any information that 'A.score' is > > > a bag, but I can see that A.score is an element of bag. > > Not true. 'score' is the name of the field. 'A.score' is a bag of just > > the scores. Using the dot '.' is a way of pulling out specific fields > > from every tuple within a bag to result in another bag. Consider: > > > > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int); > > B = GROUP A BY no; > > DUMP B; > > > > (1,{(henrietta,1,25),(sally,1,82)}) > > (3,{(fred,3,120)}) > > (4,{(elsie,4,45)}) > > > > C = FOREACH B GENERATE A.score; > > DUMP C; > > > > ({(25),(82)}) > > ({(120)}) > > ({(45)}) > > > > Got it? > > > > > And why if I delete the quantifier 'A.', it works? > > > > > > I just changed my pig code as > > > > > > A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, > > score: > > > int); > > > B = GROUP A BY no; > > > C = FOREACH B { > > > D = FILTER A BY score > 80; > > > GENERATE D.name, D.score;} > > > DUMP C; > > > > > > I got an empty bag! > > 'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them > > at the end as in the example > > > > > > > > The input is as: > > > henrietta 1 25 > > > sally 1 82 > > > fred 3 120 > > > elsie 4 45 > > > > > > The output is as: > > > ({(sally)},{(82)}) > > > ({(fred)},{(120)}) > > > ({},{}) > > > > > > As you see, I got an empty tuple? why? > > There are three tuples, one for each group (1, 3, and 4). The filter > > condition left the bags from group 4 empty since the only tuple, > > (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the > > empty ones are discarded. > > > > --jacob > > @thedatachef > > > > > > > > Yong > > > > > > 2011/7/19 Jacob Perkins <[email protected]> > > > > > > > I think it's because 'A.score' is a bag but Pig needs a reference to > a > > > > field in the tuples. This worked for me: > > > > > > > > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int); > > > > B = GROUP A BY no; > > > > C = FOREACH B { > > > > D = FILTER A BY score > 80; > > > > GENERATE FLATTEN(D.(name, score)); > > > > }; > > > > DUMP C; > > > > > > > > on the following data: > > > > > > > > $: cat foo.tsv > > > > henrietta 1 25 > > > > sally 1 82 > > > > fred 3 120 > > > > elsie 4 45 > > > > > > > > yields: > > > > > > > > > > > > Does that work for you? > > > > > > > > --jacob > > > > @thedatachef > > > > > > > > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote: > > > > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, > score: > > > > > int); > > > > > B = GROUP A BY no; > > > > > C = FOREACH B { > > > > > D = FILTER A BY A.score > 80; > > > > > GENERATE D.name, D.score;} > > > > > DUMP C; > > > > > > > > > > > > > > >
