Dan, Thanks so much. Was able to get your code to work.
Now I just have to educate myself on how exactly it works :-) But now I know here to look to educate. Best, Steven On Apr 28, 2014, at 1:16 PM, Dan DeCapria, CivicScience <[email protected]> wrote: > Hi Steven, > > You can use a Group By on the address information, and then perform a Dense > Rank to get new key ids. Consider the following: > > A = LOAD '/input' USING PigStorage('\t', '-noschema') AS (k:long, > address01:chararray, address02:chararray, city:chararray, state:chararray); > B = GROUP A BY (address01, address02, city, state); > C = FOREACH B GENERATE FLATTEN(group) AS (address01, address02, city, > state), A.(k) AS key_bag:bag{key_tuple:tuple(k)}; > D = RANK C BY state DESC, city DESC, address01 DESC, address02 DESC DENSE; > -- use 'dense' here to handle non-uniqueness issues > E = FOREACH D GENERATE 100000L * (long)rank_C AS new_key:long, address01, > address02, city, state, key_bag; > > grunt> DESCRIBE E; > E: {new_key: long,address01: chararray,address02: chararray,city: > chararray,state: chararray,key_bag: {key_tuple: (k: long)}} > > Hope this helps, > > -Dan > > > On Mon, Apr 28, 2014 at 1:52 PM, Steven E. Waldren <[email protected]>wrote: > >> I am trying to Group a relation and then create a list of values from a >> field in the relation. >> >> input: >> (100001),(500 W 1st), (suite 500), (albany), (new york) >> (100002),(500 W 1st), (suite 500), (albany), (new york) >> >> desired output would be something like: >> >> ((500 W 1st),(suite 500), (albany),(new york)), {(100001),(100002)} >> >> >> I want to create a list of ids (100001, 100002) for each unique address. >> >> I cannot seem to find any examples on the Web and cannot seem to correctly >> use data fu’s AppendToBag. >> >> Thanks, >> Steven
