I'm not sure if this can be done at the UDF level, or if it'd have to be done lower level. Imagine you have a good candidate for a replicated join, but beyond that you know most about the structure of one of the pieces of information you are joining (for example, that you could build a binary search tree from it and do your comparisons really quickly, or something). Is there a way to make your own join, or extend the one in pig? I could imagine a UDF that takes two bags, the left piece and the right piece, constructs your join, etc, but I don't know that that would be as fast.
Any thoughts?
