Ah, I understand now.  That sounds pretty useful and is something we would
currently plan very inefficiently.


On Sun, Jul 27, 2014 at 1:07 AM, Christos Kozanitis <kozani...@berkeley.edu>
wrote:

> Thanks Michael for the recommendations. Actually the region-join (or I
> could name it range-join or interval-join) that I was thinking should join
> the entries of two tables with inequality predicates. For example if table
> A(col1 int, col2 int) contains entries (1,4) and (10,12) and table b(c1
> int, c2 int) contains entries (3,6) and (43,23) then the region-join of A,
> B on (col1 < c1 and c2 < col2) should produce the tuple(1,4,3,6).
>
> Does it make sense?
>
> Actually there is a JIRA on a similar topic for Hive here:
> https://issues.apache.org/jira/browse/HIVE-556
>
> Also ADAM implements region-joins here:
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/RegionJoin.scala
>
> I was thinking to provide an improved version of method "partitionAndJoin"
> from the ADAM implementation above
>
>
>
> On Sat, Jul 26, 2014 at 12:37 PM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> A very simple example of adding a new operator to Spark SQL:
>> https://github.com/apache/spark/pull/1366
>> An example of adding a new type of join to Spark SQL:
>> https://github.com/apache/spark/pull/837
>>
>> Basically, you will need to add a new physical operator that inherits
>> from SparkPlan and a Strategy that causes the query planner to select it.
>>  Maybe you can explain a little more what you mean by region-join?  If its
>> only a different algorithm, and not a logically different type of join,
>> then you will not need to make some of he logical modifications that the
>> second PR did.
>>
>> Often the hardest part here is going to be figuring out when to use one
>> join over another.  Right now the rules are pretty straightforward: The
>> joins that are picked first are the most efficient but only handle certain
>> cases (inner joins with equality predicates).  When that is not the case it
>> falls back on slower, but more general operators.  If there are more subtle
>> trade offs involved then we may need to wait until we have more statistics
>> to help us make the choice.
>>
>> I'd suggest opening a JIRA and proposing a design before going too far.
>>
>> Michael
>>
>>
>> On Sat, Jul 26, 2014 at 3:32 AM, Christos Kozanitis <
>> kozani...@berkeley.edu> wrote:
>>
>>> Hello
>>>
>>> I was wondering is it easy for you guys to point me to what modules I
>>> need to update if I had to add extra functionality to sparkSQL?
>>>
>>> I was thinking to implement a region-join operator and I guess I should
>>> add the implementation details under joins.scala but what else do I need to
>>> modify?
>>>
>>> thanks
>>> Christos
>>>
>>
>>
>

Reply via email to