Sarath,

First a quick note about unique ID generation. When your job is distributed
over a cluster, there is a chance of collision. Since the static
incrementer won't be shared over the mappers, if two tuples are given the
same key at the same millisecond, they get get the same UID. Low
probability, but if your requirement involves absolutely no collisions, I
would incorporate the mapper ID, or some other piece of unique,
map-specific information.

As far as why it is generating keys, it's because each of the dumps is
generating a new set of m/r jobs for the entire pipeline. If you don't want
that to be the case, then you should use stores instead.

2012/4/9 Sarath <[email protected]>

> Hi All,
>
> I need to generate a unique key for each grouped tuple and then store it
> along with each tuple.
> For this I have created a UDF which generates a key (current time in
> milliseconds appended with a static incrementing sequence number)
> I used it in the script as below -
>
> /1.  a = load '1.txt' using PigStorage(',') as (id: chararray, name:
> chararray, age: int);
> 2.  b = load '2.txt' using PigStorage(',') as (id: chararray, name:
> chararray, desg: chararray);
> 3.  c = cogroup a by (ide, name), b by (id, name);
> 4.  d = filter c by not IsEmpty(a) and not IsEmpty(b);
> 5.  e = foreach d generate myudf.KeyGenerator(*), *;
> 6.  dump e;
> 7.  f = foreach e generate $0, flatten(a);
> 8.  dump f;
> 9.  g = foreach e generate $0, flatten(b);
> 10.dump g;/
>
> At step 6, I could see the unique key generated and printed.
> But when it comes to step 8 & 10, the unique key printed is different to
> what is generated at step 6 even though I'm carrying the same key to these
> steps in the script.
>
> What is going wrong? How can I achieve this requirement?
>
> Regards,
> Sarath.
>

Reply via email to