Re: How would you implement a custom join?

Alan Gates Fri, 28 Jan 2011 08:47:55 -0800

Depending on the join algorithm you may be able to implement it withcogroup, a custom UDF, and possibly a custom partitioner. I haven'tfinished reading the band join algorithm paper I sent a link for, butI suspect it requires some records to be duplicated (since recordswithin the band will need to be sent to multiple reducers to matchrecords from the other side). That you cannot do without implementinga custom join.

For an example of how to implement a custom join take a look at https://issues.apache.org/jira/browse/PIG-792This has a lot of sampling code you won't have to worry about. Butit will give you an idea of the logical and physical operators insidePig that would be needed.

Also, here's some input from Chris Olston, one of our researchscientists at Yahoo with expertise in databases:

>>>

I have not read the paper you sent but it seems to be about so-called“band joins”, which are a special case of non-equijoin that arisefrequently in practice, and offer obvious opportunities for locality-based strategies e.g. using indexes and (distributed) partitioning.One approach that would be consistent with the Pig “low-level”philosophy would be to expose “BAND JOIN” as an operator and have acorresponding implementation along the lines of what that paperproposes.

Also, as you know Utkarsh’s original implementation of CROSS (stillthe same?) performs a “generalized fragment-and-replicate” strategy,which is a way to do arbitrary non-equi-joins in a way that spreadswork onto lots of machines (CROSS can be seen as non-equi-join with avery promiscuous join predicate :). There are probably papers that tryto optimize the NxM grid structure of the generalized f-and-rtopology, based on the relative sizes of the inputs, the joinselectivity, data distributions, etc. I think the paper thatoriginally surfaced this idea is: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=250116. Not sure whether there were follow-on papers that try to do moreoptimization. Fast-forwarding to modern times, I believe the AlmadenSIGMOD’10 paper might have investigated f-and-r join strategies forthe map-reduce context:http://portal.acm.org/citation.cfm?doid=1807167.1807273. There’s also the Ullman paper that proposes (but does not evaluateempirically) some map-reduce join strategies:http://ilpubs.stanford.edu:8090/957/1/mapred-join-report.pdf

<<<

Alan.

On Jan 28, 2011, at 7:35 AM, Jonathan Coveney wrote:

I'm not sure if this can be done at the UDF level, or if it'd haveto bedone lower level. Imagine you have a good candidate for a replicatedjoin,but beyond that you know most about the structure of one of thepieces ofinformation you are joining (for example, that you could build abinarysearch tree from it and do your comparisons really quickly, orsomething).Is there a way to make your own join, or extend the one in pig? Icould
imagine a UDF that takes two bags, the left piece and the right piece,
constructs your join, etc, but I don't know that that would be asfast.
Any thoughts?

Re: How would you implement a custom join?

Reply via email to