Re: Taking advantage of structure when doing UDFs and whatnot?

Alan Gates Tue, 04 Jan 2011 13:50:54 -0800

Answers inline.

On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote:

I wasn't quite sure what title this, but hopefully it'll make sense.I havea couple of questions relating to a query that ultimately seeks todo this
You have

1 10
1 12
1 15
1 16
2 1
2 2
2 3
2 6
You want your output to be the difference between the successivenumbers in
the second column, ie

1 (10,0)
1 (12,2)
1 (15,3)
1 (15,1)
2 (1,0)
2 (2,1)
2 (3,1)
2 (6,3)
Obviously, I need to write a udf to do this, but I have a couplequestions..
1) if we know for a fact that the rows for a given first column willALWAYSbe on the same node, do we need to do anything to take advantage ofthat? Myassumption would be that the group operation would be smart enoughto takecare of this, but I am not sure how it avoids checking to make surethatother nodes don't have additional info (even if I can say for a factthatthey don't). Then again, given replication of data I guess if you doanoperation on the grouped data it might still try and distribute thatover
the filesystem?

First, whether they are located in the same node does not matter.What matters is whether they will all be in the same split when themaps are started. If they are stored in an HDFS file this usuallymeans that they are all in the same block.

Group by cannot know a priori that all values of the key will belocated in the same split. As of Pig 0.7 you can tell Pig this bysaying "using 'collected'" after the group by statement. See http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUPfor exact syntax and restrictions. This tells Pig to do the groupingin the map phase since it does not need to do a shuffle and reduce tocollect all the keys together.

2) The number of values in the second column can potentially belarge, and I
want this process to be quick, so what's the best way to implement it?
Naively I would say to group everything, then pass that bag to a UDFwhichsorts, does the calculation, and then returns a new bag with thetuples.
This doesn't seem like it is taking advantage of a distributed
framework...would splitting it up into 2 UDF's, one which sorts thebag, andthen another which returns the tuples (and now that it's sorted, youcould
distribute it better), be better?


B = group A by firstfield;
C = foreach B {
        C1 = order A by secondfield;
        generate group, youudf(C1);
}

The order inside the foreach will order each collection by the secondfield, so there's no need to write a UDF for that. In fact Pig willtake advantage of the secondary sort in MR so that there isn't even aseparate sorting pass over the data. yourudf should then implementthe Accumulator interface so that it will receive collections ofrecords in batches that will be sorted.


Alan.

I'm trying to avoid writing my own MR (as I never have before), butam notaverse to it if necessary. I am just not sure of how to get pig todo it as
efficiently as (I think) it can be done.

I appreciate your help!
Jon

Re: Taking advantage of structure when doing UDFs and whatnot?

Reply via email to