Re: Taking advantage of structure when doing UDFs and whatnot?

Alan Gates Tue, 04 Jan 2011 15:24:15 -0800


On Jan 4, 2011, at 2:07 PM, Jonathan Coveney wrote:

Thanks for the help Alan, I really appreciate it. Can you currentlyextendinterfaces in python UDF's? I am not super familiar with how jythonand
python interact in that capacity.

No, we just introduced the Python UDFs in 0.8. We haven't yet addedthe ability for them to extend the Algebraic and Accumulator interfaces.


Alan.

The internal sort in the foreach and the using 'collected' (assumingI can
get it to work :) should be big wins.

2011/1/4 Alan Gates <[email protected]>
Answers inline.


On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote:
I wasn't quite sure what title this, but hopefully it'll makesense. I
have
a couple of questions relating to a query that ultimately seeks todo this
You have

1 10
1 12
1 15
1 16
2 1
2 2
2 3
2 6
You want your output to be the difference between the successivenumbers
in
the second column, ie

1 (10,0)
1 (12,2)
1 (15,3)
1 (15,1)
2 (1,0)
2 (2,1)
2 (3,1)
2 (6,3)

Obviously, I need to write a udf to do this, but I have a couple
questions..

1) if we know for a fact that the rows for a given first column will
ALWAYS
be on the same node, do we need to do anything to take advantageof that?
My
assumption would be that the group operation would be smart enoughto takecare of this, but I am not sure how it avoids checking to makesure thatother nodes don't have additional info (even if I can say for afact thatthey don't). Then again, given replication of data I guess if youdo anoperation on the grouped data it might still try and distributethat over
the filesystem?
First, whether they are located in the same node does not matter.Whatmatters is whether they will all be in the same split when the mapsarestarted. If they are stored in an HDFS file this usually meansthat they
are all in the same block.
Group by cannot know a priori that all values of the key will belocated in
the same split.  As of Pig 0.7 you can tell Pig this by saying "using
'collected'" after the group by statement.  See
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUP for exact
syntax and restrictions. This tells Pig to do the grouping in themap phasesince it does not need to do a shuffle and reduce to collect allthe keys
together.
2) The number of values in the second column can potentially belarge, and
I
want this process to be quick, so what's the best way to implementit?Naively I would say to group everything, then pass that bag to aUDF whichsorts, does the calculation, and then returns a new bag with thetuples.
This doesn't seem like it is taking advantage of a distributed
framework...would splitting it up into 2 UDF's, one which sortsthe bag,
and
then another which returns the tuples (and now that it's sorted,you could
distribute it better), be better?
B = group A by firstfield;
C = foreach B {
      C1 = order A by secondfield;
      generate group, youudf(C1);
}

The order inside the foreach will order each collection by the second
field, so there's no need to write a UDF for that. In fact Pigwill takeadvantage of the secondary sort in MR so that there isn't even aseparatesorting pass over the data. yourudf should then implement theAccumulatorinterface so that it will receive collections of records in batchesthat
will be sorted.

Alan.
I'm trying to avoid writing my own MR (as I never have before),but am notaverse to it if necessary. I am just not sure of how to get pig todo it
as
efficiently as (I think) it can be done.

I appreciate your help!
Jon

Re: Taking advantage of structure when doing UDFs and whatnot?

Reply via email to