Sarath, First a quick note about unique ID generation. When your job is distributed over a cluster, there is a chance of collision. Since the static incrementer won't be shared over the mappers, if two tuples are given the same key at the same millisecond, they get get the same UID. Low probability, but if your requirement involves absolutely no collisions, I would incorporate the mapper ID, or some other piece of unique, map-specific information.
As far as why it is generating keys, it's because each of the dumps is generating a new set of m/r jobs for the entire pipeline. If you don't want that to be the case, then you should use stores instead. 2012/4/9 Sarath <[email protected]> > Hi All, > > I need to generate a unique key for each grouped tuple and then store it > along with each tuple. > For this I have created a UDF which generates a key (current time in > milliseconds appended with a static incrementing sequence number) > I used it in the script as below - > > /1. a = load '1.txt' using PigStorage(',') as (id: chararray, name: > chararray, age: int); > 2. b = load '2.txt' using PigStorage(',') as (id: chararray, name: > chararray, desg: chararray); > 3. c = cogroup a by (ide, name), b by (id, name); > 4. d = filter c by not IsEmpty(a) and not IsEmpty(b); > 5. e = foreach d generate myudf.KeyGenerator(*), *; > 6. dump e; > 7. f = foreach e generate $0, flatten(a); > 8. dump f; > 9. g = foreach e generate $0, flatten(b); > 10.dump g;/ > > At step 6, I could see the unique key generated and printed. > But when it comes to step 8 & 10, the unique key printed is different to > what is generated at step 6 even though I'm carrying the same key to these > steps in the script. > > What is going wrong? How can I achieve this requirement? > > Regards, > Sarath. >
