Answers inline.
On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote:
I wasn't quite sure what title this, but hopefully it'll make sense.
I have
a couple of questions relating to a query that ultimately seeks to
do this
You have
1 10
1 12
1 15
1 16
2 1
2 2
2 3
2 6
You want your output to be the difference between the successive
numbers in
the second column, ie
1 (10,0)
1 (12,2)
1 (15,3)
1 (15,1)
2 (1,0)
2 (2,1)
2 (3,1)
2 (6,3)
Obviously, I need to write a udf to do this, but I have a couple
questions..
1) if we know for a fact that the rows for a given first column will
ALWAYS
be on the same node, do we need to do anything to take advantage of
that? My
assumption would be that the group operation would be smart enough
to take
care of this, but I am not sure how it avoids checking to make sure
that
other nodes don't have additional info (even if I can say for a fact
that
they don't). Then again, given replication of data I guess if you do
an
operation on the grouped data it might still try and distribute that
over
the filesystem?
First, whether they are located in the same node does not matter.
What matters is whether they will all be in the same split when the
maps are started. If they are stored in an HDFS file this usually
means that they are all in the same block.
Group by cannot know a priori that all values of the key will be
located in the same split. As of Pig 0.7 you can tell Pig this by
saying "using 'collected'" after the group by statement. See http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUP
for exact syntax and restrictions. This tells Pig to do the grouping
in the map phase since it does not need to do a shuffle and reduce to
collect all the keys together.
2) The number of values in the second column can potentially be
large, and I
want this process to be quick, so what's the best way to implement it?
Naively I would say to group everything, then pass that bag to a UDF
which
sorts, does the calculation, and then returns a new bag with the
tuples.
This doesn't seem like it is taking advantage of a distributed
framework...would splitting it up into 2 UDF's, one which sorts the
bag, and
then another which returns the tuples (and now that it's sorted, you
could
distribute it better), be better?
B = group A by firstfield;
C = foreach B {
C1 = order A by secondfield;
generate group, youudf(C1);
}
The order inside the foreach will order each collection by the second
field, so there's no need to write a UDF for that. In fact Pig will
take advantage of the secondary sort in MR so that there isn't even a
separate sorting pass over the data. yourudf should then implement
the Accumulator interface so that it will receive collections of
records in batches that will be sorted.
Alan.
I'm trying to avoid writing my own MR (as I never have before), but
am not
averse to it if necessary. I am just not sure of how to get pig to
do it as
efficiently as (I think) it can be done.
I appreciate your help!
Jon