Hi Steven,
You can use a Group By on the address information, and then perform a Dense
Rank to get new key ids. Consider the following:
A = LOAD '/input' USING PigStorage('\t', '-noschema') AS (k:long,
address01:chararray, address02:chararray, city:chararray, state:chararray);
B = GROUP A BY (address01, address02, city, state);
C = FOREACH B GENERATE FLATTEN(group) AS (address01, address02, city,
state), A.(k) AS key_bag:bag{key_tuple:tuple(k)};
D = RANK C BY state DESC, city DESC, address01 DESC, address02 DESC DENSE;
-- use 'dense' here to handle non-uniqueness issues
E = FOREACH D GENERATE 100000L * (long)rank_C AS new_key:long, address01,
address02, city, state, key_bag;
grunt> DESCRIBE E;
E: {new_key: long,address01: chararray,address02: chararray,city:
chararray,state: chararray,key_bag: {key_tuple: (k: long)}}
Hope this helps,
-Dan
On Mon, Apr 28, 2014 at 1:52 PM, Steven E. Waldren <[email protected]>wrote:
> I am trying to Group a relation and then create a list of values from a
> field in the relation.
>
> input:
> (100001),(500 W 1st), (suite 500), (albany), (new york)
> (100002),(500 W 1st), (suite 500), (albany), (new york)
>
> desired output would be something like:
>
> ((500 W 1st),(suite 500), (albany),(new york)), {(100001),(100002)}
>
>
> I want to create a list of ids (100001, 100002) for each unique address.
>
> I cannot seem to find any examples on the Web and cannot seem to correctly
> use data fu’s AppendToBag.
>
> Thanks,
> Steven