I wasn't quite sure what title this, but hopefully it'll make sense. I have a couple of questions relating to a query that ultimately seeks to do this
You have 1 10 1 12 1 15 1 16 2 1 2 2 2 3 2 6 You want your output to be the difference between the successive numbers in the second column, ie 1 (10,0) 1 (12,2) 1 (15,3) 1 (15,1) 2 (1,0) 2 (2,1) 2 (3,1) 2 (6,3) Obviously, I need to write a udf to do this, but I have a couple questions.. 1) if we know for a fact that the rows for a given first column will ALWAYS be on the same node, do we need to do anything to take advantage of that? My assumption would be that the group operation would be smart enough to take care of this, but I am not sure how it avoids checking to make sure that other nodes don't have additional info (even if I can say for a fact that they don't). Then again, given replication of data I guess if you do an operation on the grouped data it might still try and distribute that over the filesystem? 2) The number of values in the second column can potentially be large, and I want this process to be quick, so what's the best way to implement it? Naively I would say to group everything, then pass that bag to a UDF which sorts, does the calculation, and then returns a new bag with the tuples. This doesn't seem like it is taking advantage of a distributed framework...would splitting it up into 2 UDF's, one which sorts the bag, and then another which returns the tuples (and now that it's sorted, you could distribute it better), be better? I'm trying to avoid writing my own MR (as I never have before), but am not averse to it if necessary. I am just not sure of how to get pig to do it as efficiently as (I think) it can be done. I appreciate your help! Jon
