Hi Uri,
Try this:
data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
iid:chararray, num1:int, num2:int);
grouped = group data by cid;
results = foreach grouped generate FLATTEN(data), SUM(data.num2) as sum;
appended = foreach results generate cid, iid, num1, num2, (sum > 0 ? num1 :
0) as num3;
dump appended;
This will give you:
(a,e,11,0,0)
(b,f,2,2,2)
(c,g,3,3,3)
(c,h,44,44,44)
(c,i,75,0,75)
(d,j,89,0,0)
(d,k,120,0,0)
(d,l,3000,0,0)
Thanks,
Cheolsoo
On Tue, Jan 22, 2013 at 5:17 PM, Uri Laserson <[email protected]> wrote:
> I have data that looks like this:
>
> a e 11 0
> b f 2 2
> c g 3 3
> c h 44 44
> c i 75 0
> d j 89 0
> d k 120 0
> d l 3000 0
>
> and I load it like so:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
>
> I want to group by the first column, cid. For each group, if any of the
> num2 values (last column) are positive, I want to output every tuple in
> that group with an extra field equal to num1. If all the num2 values for
> that group are zero, then I want to output every tuple in that group with
> an extra field equal to 0.
>
> I figured something like this would work:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
> grouped = group data by cid;
> results = foreach grouped {
> result1 = SUM(data.num2);
> extended = foreach data generate *, result1 > 0 ? num1 : 0;
> generate FLATTEN(extended);
> };
>
> but it does not. I get this error:
>
> 2013-01-22 17:15:07,647 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 98, column 48> mismatched input '>' expecting SEMI_COLON
>
> What is the proper way to do this? From the MapReduce perspective, I group
> by the key, and in the reducer, I compute a value for each group, and then
> emit every single value for that group along with some extra data.
>
> Thanks!
> Uri
>
>
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> [email protected]
>