I wasn't quite sure what title this, but hopefully it'll make sense. I have
a couple of questions relating to a query that ultimately seeks to do this

You have

1 10
1 12
1 15
1 16
2 1
2 2
2 3
2 6

You want your output to be the difference between the successive numbers in
the second column, ie

1 (10,0)
1 (12,2)
1 (15,3)
1 (15,1)
2 (1,0)
2 (2,1)
2 (3,1)
2 (6,3)

Obviously, I need to write a udf to do this, but I have a couple questions..

1) if we know for a fact that the rows for a given first column will ALWAYS
be on the same node, do we need to do anything to take advantage of that? My
assumption would be that the group operation would be smart enough to take
care of this, but I am not sure how it avoids checking to make sure that
other nodes don't have additional info (even if I can say for a fact that
they don't). Then again, given replication of data I guess if you do an
operation on the grouped data it might still try and distribute that over
the filesystem?

2) The number of values in the second column can potentially be large, and I
want this process to be quick, so what's the best way to implement it?
Naively I would say to group everything, then pass that bag to a UDF which
sorts, does the calculation, and then returns a new bag with the tuples.
This doesn't seem like it is taking advantage of a distributed
framework...would splitting it up into 2 UDF's, one which sorts the bag, and
then another which returns the tuples (and now that it's sorted, you could
distribute it better), be better?

I'm trying to avoid writing my own MR (as I never have before), but am not
averse to it if necessary. I am just not sure of how to get pig to do it as
efficiently as (I think) it can be done.

I appreciate your help!
Jon

Reply via email to