Re: A Spark Design Problem

2014-11-01 Thread Steve Lewis
join seems to me the proper approach followed by keying  the fits by KeyID
and using combineByKey to choose the best -
I am implementing that now and will report on performance

On Fri, Oct 31, 2014 at 11:56 AM, Sonal Goyal  wrote:

> Does the following help?
>
> JavaPairRDD join with JavaPairRDD
>
> If you partition both RDDs by the bin id, I think you should be able to
> get what you want.
>
> Best Regards,
> Sonal
> Nube Technologies 
>
> 
>
>
>>
>> On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis 
>> wrote:
>>
>>>
>>>  The original problem is in biology but the following captures the CS
>>> issues, Assume ...
>>>
>>


Re: A Spark Design Problem

2014-10-31 Thread Sonal Goyal
Does the following help?

JavaPairRDD join with JavaPairRDD

If you partition both RDDs by the bin id, I think you should be able to get
what you want.

Best Regards,
Sonal
Nube Technologies 





On Fri, Oct 31, 2014 at 11:19 PM,  wrote:

> Hi Steve,
>
> Are you talking about sequence alignment ?
>
> —
> FG
>
>
> On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis 
> wrote:
>
>>
>>  The original problem is in biology but the following captures the CS
>> issues, Assume I  have a large number of locks and a large number of keys.
>> There is a scoring function between keys and locks and a key that  fits a
>> lock will have a high score. There may be many keys fitting one lock and a
>> key may fit no locks well. The object is to find the best fitting lock for
>> each key.
>>
>> Assume that the number of keys and locks is high enough that taking the
>> cartesian product of the two is computationally impractical. Also assume
>> that keys and locks have an attached location which is accurate within an
>> error (say 1 Km). Only keys and locks within 1 Km need be compared.
>> Now assume I can create a JavaRDD and a JavaRDD . I could
>> divide the locations into 1 Km squared bins and look only within a few
>> bins. Assume that it is practical to take a cartesian product for all
>> elements in a bin but not to keep all elements in memory. I could map my
>> RDDs into PairRDDs where the key is the bin assigned by location
>>
>> I know how to take the cartesian product of two JavaRDDs but not how to
>> take a cartesian product of sets of elements sharing a common key (bin),
>> Any suggestions. Assume that in the worst cases the number of elements in a
>> bin are too large to keep in memory although if a bin were subdivided into,
>> say 100 subbins elements would fit in memory.
>>
>> Any thoughts as to how to attack the problem
>>
>
>


Re: A Spark Design Problem

2014-10-31 Thread francois . garillot
Hi Steve,

Are you talking about sequence alignment ?



—
FG

On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis 
wrote:

>  The original problem is in biology but the following captures the CS
> issues, Assume I  have a large number of locks and a large number of keys.
> There is a scoring function between keys and locks and a key that  fits a
> lock will have a high score. There may be many keys fitting one lock and a
> key may fit no locks well. The object is to find the best fitting lock for
> each key.
> Assume that the number of keys and locks is high enough that taking the
> cartesian product of the two is computationally impractical. Also assume
> that keys and locks have an attached location which is accurate within an
> error (say 1 Km). Only keys and locks within 1 Km need be compared.
> Now assume I can create a JavaRDD and a JavaRDD . I could
> divide the locations into 1 Km squared bins and look only within a few
> bins. Assume that it is practical to take a cartesian product for all
> elements in a bin but not to keep all elements in memory. I could map my
> RDDs into PairRDDs where the key is the bin assigned by location
> I know how to take the cartesian product of two JavaRDDs but not how to
> take a cartesian product of sets of elements sharing a common key (bin),
> Any suggestions. Assume that in the worst cases the number of elements in a
> bin are too large to keep in memory although if a bin were subdivided into,
> say 100 subbins elements would fit in memory.
> Any thoughts as to how to attack the problem

A Spark Design Problem

2014-10-31 Thread Steve Lewis
 The original problem is in biology but the following captures the CS
issues, Assume I  have a large number of locks and a large number of keys.
There is a scoring function between keys and locks and a key that  fits a
lock will have a high score. There may be many keys fitting one lock and a
key may fit no locks well. The object is to find the best fitting lock for
each key.

Assume that the number of keys and locks is high enough that taking the
cartesian product of the two is computationally impractical. Also assume
that keys and locks have an attached location which is accurate within an
error (say 1 Km). Only keys and locks within 1 Km need be compared.
Now assume I can create a JavaRDD and a JavaRDD . I could
divide the locations into 1 Km squared bins and look only within a few
bins. Assume that it is practical to take a cartesian product for all
elements in a bin but not to keep all elements in memory. I could map my
RDDs into PairRDDs where the key is the bin assigned by location

I know how to take the cartesian product of two JavaRDDs but not how to
take a cartesian product of sets of elements sharing a common key (bin),
Any suggestions. Assume that in the worst cases the number of elements in a
bin are too large to keep in memory although if a bin were subdivided into,
say 100 subbins elements would fit in memory.

Any thoughts as to how to attack the problem