Does the following help? JavaPairRDD<bin,key> join with JavaPairRDD<bin,lock>
If you partition both RDDs by the bin id, I think you should be able to get what you want. Best Regards, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Oct 31, 2014 at 11:19 PM, <francois.garil...@typesafe.com> wrote: > Hi Steve, > > Are you talking about sequence alignment ? > > — > FG > > > On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> >> The original problem is in biology but the following captures the CS >> issues, Assume I have a large number of locks and a large number of keys. >> There is a scoring function between keys and locks and a key that fits a >> lock will have a high score. There may be many keys fitting one lock and a >> key may fit no locks well. The object is to find the best fitting lock for >> each key. >> >> Assume that the number of keys and locks is high enough that taking the >> cartesian product of the two is computationally impractical. Also assume >> that keys and locks have an attached location which is accurate within an >> error (say 1 Km). Only keys and locks within 1 Km need be compared. >> Now assume I can create a JavaRDD<Keys> and a JavaRDD<Locks> . I could >> divide the locations into 1 Km squared bins and look only within a few >> bins. Assume that it is practical to take a cartesian product for all >> elements in a bin but not to keep all elements in memory. I could map my >> RDDs into PairRDDs where the key is the bin assigned by location >> >> I know how to take the cartesian product of two JavaRDDs but not how to >> take a cartesian product of sets of elements sharing a common key (bin), >> Any suggestions. Assume that in the worst cases the number of elements in a >> bin are too large to keep in memory although if a bin were subdivided into, >> say 100 subbins elements would fit in memory. >> >> Any thoughts as to how to attack the problem >> > >