Hi Rob, Thanks for sharing. The approach you take is similar to how Pig implements the cross product (see the cross section in: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html)
What you'll probably find interesting is this article: Processing Theta-Joins using MapReduce (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.229.1890&rep=rep1&type=pdf) Which features a similar grid like approach, but with some smart tricks. Also you probably like Jimmy Lin's articles on pairwise similarity in MR (http://www.umiacs.umd.edu/~jimmylin/publications/index.html). best, Vasco On Mon, Dec 31, 2012 at 7:42 PM, Rob Styles <[email protected]> wrote: > Happy New Year :) > > Thought some of you might find this useful. > > We've developed an approach to doing pairwise comparisons on large datasets > that doesn't require visibility of the whole dataset at any time. The > approach brings together pairs for comparison using incrementing coordinates > to target a value at a cell. > > http://dynamicorange.com/2012/12/31/pairwise-comparisons-of-large-datasets/ > > There is still work to do on making the approach more efficient and trying > to eliminate a pre-processing step. Help gratefully received. > > If there's a toolset already out there for doing this I'd be happy to hear > about that too! > > thanks > > rob
