I found really interesting all those papers. I haven't finished reading the band join algorithm paper either, but there are a couple of things that intrigue me e.g. the Almaden paper compares its results against Pig version 0.2 I mean I think the study made by them is great, but Pig is in a stable 0.8 now, wouldn't Pig perform better now than then? Has Pig embraced any of the paper suggestions? Anyways, creating a custom join inside a UDF might not be suitable for some specialized types of join, but maybe for others such as Parallel Set-*Similarity Joins *would be easier (flamingo.ics.uci.edu/pub/sigmod10-vernica.pdf), don't you think it might be possible? I mean we could take advantage of not doing only raw MapReduce
Renato M. 2011/1/28 Alan Gates <[email protected]> > Depending on the join algorithm you may be able to implement it with > cogroup, a custom UDF, and possibly a custom partitioner. I haven't > finished reading the band join algorithm paper I sent a link for, but I > suspect it requires some records to be duplicated (since records within the > band will need to be sent to multiple reducers to match records from the > other side). That you cannot do without implementing a custom join. > > For an example of how to implement a custom join take a look at > https://issues.apache.org/jira/browse/PIG-792 This has a lot of sampling > code you won't have to worry about. But it will give you an idea of the > logical and physical operators inside Pig that would be needed. > > Also, here's some input from Chris Olston, one of our research scientists > at Yahoo with expertise in databases: > > >>> > I have not read the paper you sent but it seems to be about so-called “band > joins”, which are a special case of non-equijoin that arise frequently in > practice, and offer obvious opportunities for locality-based strategies e.g. > using indexes and (distributed) partitioning. One approach that would be > consistent with the Pig “low-level” philosophy would be to expose “BAND > JOIN” as an operator and have a corresponding implementation along the lines > of what that paper proposes. > > Also, as you know Utkarsh’s original implementation of CROSS (still the > same?) performs a “generalized fragment-and-replicate” strategy, which is a > way to do arbitrary non-equi-joins in a way that spreads work onto lots of > machines (CROSS can be seen as non-equi-join with a very promiscuous join > predicate :). There are probably papers that try to optimize the NxM grid > structure of the generalized f-and-r topology, based on the relative sizes > of the inputs, the join selectivity, data distributions, etc. I think the > paper that originally surfaced this idea is: > http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=250116. Not sure > whether there were follow-on papers that try to do more optimization. > Fast-forwarding to modern times, I believe the Almaden SIGMOD’10 paper might > have investigated f-and-r join strategies for the map-reduce context: > http://portal.acm.org/citation.cfm?doid=1807167.1807273. There’s also the > Ullman paper that proposes (but does not evaluate empirically) some > map-reduce join strategies: > http://ilpubs.stanford.edu:8090/957/1/mapred-join-report.pdf > <<< > > Alan. > > > On Jan 28, 2011, at 7:35 AM, Jonathan Coveney wrote: > > I'm not sure if this can be done at the UDF level, or if it'd have to be >> done lower level. Imagine you have a good candidate for a replicated join, >> but beyond that you know most about the structure of one of the pieces of >> information you are joining (for example, that you could build a binary >> search tree from it and do your comparisons really quickly, or something). >> Is there a way to make your own join, or extend the one in pig? I could >> imagine a UDF that takes two bags, the left piece and the right piece, >> constructs your join, etc, but I don't know that that would be as fast. >> >> Any thoughts? >> > >
