Does the following help?

JavaPairRDD<bin,key> join with JavaPairRDD<bin,lock>

If you partition both RDDs by the bin id, I think you should be able to get
what you want.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Fri, Oct 31, 2014 at 11:19 PM, <francois.garil...@typesafe.com> wrote:

> Hi Steve,
>
> Are you talking about sequence alignment ?
>
> —
> FG
>
>
> On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>>
>>  The original problem is in biology but the following captures the CS
>> issues, Assume I  have a large number of locks and a large number of keys.
>> There is a scoring function between keys and locks and a key that  fits a
>> lock will have a high score. There may be many keys fitting one lock and a
>> key may fit no locks well. The object is to find the best fitting lock for
>> each key.
>>
>> Assume that the number of keys and locks is high enough that taking the
>> cartesian product of the two is computationally impractical. Also assume
>> that keys and locks have an attached location which is accurate within an
>> error (say 1 Km). Only keys and locks within 1 Km need be compared.
>> Now assume I can create a JavaRDD<Keys> and a JavaRDD<Locks> . I could
>> divide the locations into 1 Km squared bins and look only within a few
>> bins. Assume that it is practical to take a cartesian product for all
>> elements in a bin but not to keep all elements in memory. I could map my
>> RDDs into PairRDDs where the key is the bin assigned by location
>>
>> I know how to take the cartesian product of two JavaRDDs but not how to
>> take a cartesian product of sets of elements sharing a common key (bin),
>> Any suggestions. Assume that in the worst cases the number of elements in a
>> bin are too large to keep in memory although if a bin were subdivided into,
>> say 100 subbins elements would fit in memory.
>>
>> Any thoughts as to how to attack the problem
>>
>
>

Reply via email to