Depending on the join algorithm you may be able to implement it with
cogroup, a custom UDF, and possibly a custom partitioner. I haven't
finished reading the band join algorithm paper I sent a link for, but
I suspect it requires some records to be duplicated (since records
within the band will need to be sent to multiple reducers to match
records from the other side). That you cannot do without implementing
a custom join.
For an example of how to implement a custom join take a look at https://issues.apache.org/jira/browse/PIG-792
This has a lot of sampling code you won't have to worry about. But
it will give you an idea of the logical and physical operators inside
Pig that would be needed.
Also, here's some input from Chris Olston, one of our research
scientists at Yahoo with expertise in databases:
>>>
I have not read the paper you sent but it seems to be about so-called
“band joins”, which are a special case of non-equijoin that arise
frequently in practice, and offer obvious opportunities for locality-
based strategies e.g. using indexes and (distributed) partitioning.
One approach that would be consistent with the Pig “low-level”
philosophy would be to expose “BAND JOIN” as an operator and have a
corresponding implementation along the lines of what that paper
proposes.
Also, as you know Utkarsh’s original implementation of CROSS (still
the same?) performs a “generalized fragment-and-replicate” strategy,
which is a way to do arbitrary non-equi-joins in a way that spreads
work onto lots of machines (CROSS can be seen as non-equi-join with a
very promiscuous join predicate :). There are probably papers that try
to optimize the NxM grid structure of the generalized f-and-r
topology, based on the relative sizes of the inputs, the join
selectivity, data distributions, etc. I think the paper that
originally surfaced this idea is: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=250116
. Not sure whether there were follow-on papers that try to do more
optimization. Fast-forwarding to modern times, I believe the Almaden
SIGMOD’10 paper might have investigated f-and-r join strategies for
the map-reduce context:http://portal.acm.org/citation.cfm?doid=1807167.1807273
. There’s also the Ullman paper that proposes (but does not evaluate
empirically) some map-reduce join strategies:http://ilpubs.stanford.edu:8090/957/1/mapred-join-report.pdf
<<<
Alan.
On Jan 28, 2011, at 7:35 AM, Jonathan Coveney wrote:
I'm not sure if this can be done at the UDF level, or if it'd have
to be
done lower level. Imagine you have a good candidate for a replicated
join,
but beyond that you know most about the structure of one of the
pieces of
information you are joining (for example, that you could build a
binary
search tree from it and do your comparisons really quickly, or
something).
Is there a way to make your own join, or extend the one in pig? I
could
imagine a UDF that takes two bags, the left piece and the right piece,
constructs your join, etc, but I don't know that that would be as
fast.
Any thoughts?