Hi Srinivas, you can write a user defined function for this
feed = union feed1, feed2; feed_grouped = group feed by trade-key; output = foreach feed_grouped generate flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date, trade-price) your_user_defined_function take the one or more records with the same trade-key as input, and it should only output the latest tuple of (trade-key, trade-add-date, trade-price) by the way, you can sort these 2 files by trade-key then merge them using a small script, that's much more faster than using pig. On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <[email protected]>wrote: > Hi, > > I'm trying to do updates of records in hadoop using Pig ( I know this is > not ideal but trying out POC ).. > data looks like the below: > > *feed1:* > --> here trade key is unique for each order/record > --> this is history file > > trade-key trade-add-date trade-price > *k1 05/21/2012 2000* > k2 04/21/2012 3000 > k3 03/21/2012 4000 > k4 05/21/2012 5000 > > *feed2: *--> this is the latest/daily feed > trade-key trade-add-date trade-price > k5 06/22/2012 1000 > k6 06/22/2012 2000 > *k1 06/21/2012 3000 ---> we can see here, > trade with key "k1" is appeared again..that means order with trade key "k1" > has some update* > * > * > Now I'm looking for the below output : ( merging the both files and and > looking for common key from both feeds and keeping the latest key record in > the output file ) > *k1 06/21/2012 3000* > * > k2 04/21/2012 3000 > k3 06/21/2012 4000 > k4 07/21/2012 5000 > *k5 06/22/2012 1000 > k6 06/22/2012 2000* > > any help appreciated greatly !! > * > > Regards, > Srinivas >
