Ah, I understand now. That sounds pretty useful and is something we would currently plan very inefficiently.
On Sun, Jul 27, 2014 at 1:07 AM, Christos Kozanitis <kozani...@berkeley.edu> wrote: > Thanks Michael for the recommendations. Actually the region-join (or I > could name it range-join or interval-join) that I was thinking should join > the entries of two tables with inequality predicates. For example if table > A(col1 int, col2 int) contains entries (1,4) and (10,12) and table b(c1 > int, c2 int) contains entries (3,6) and (43,23) then the region-join of A, > B on (col1 < c1 and c2 < col2) should produce the tuple(1,4,3,6). > > Does it make sense? > > Actually there is a JIRA on a similar topic for Hive here: > https://issues.apache.org/jira/browse/HIVE-556 > > Also ADAM implements region-joins here: > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/RegionJoin.scala > > I was thinking to provide an improved version of method "partitionAndJoin" > from the ADAM implementation above > > > > On Sat, Jul 26, 2014 at 12:37 PM, Michael Armbrust <mich...@databricks.com > > wrote: > >> A very simple example of adding a new operator to Spark SQL: >> https://github.com/apache/spark/pull/1366 >> An example of adding a new type of join to Spark SQL: >> https://github.com/apache/spark/pull/837 >> >> Basically, you will need to add a new physical operator that inherits >> from SparkPlan and a Strategy that causes the query planner to select it. >> Maybe you can explain a little more what you mean by region-join? If its >> only a different algorithm, and not a logically different type of join, >> then you will not need to make some of he logical modifications that the >> second PR did. >> >> Often the hardest part here is going to be figuring out when to use one >> join over another. Right now the rules are pretty straightforward: The >> joins that are picked first are the most efficient but only handle certain >> cases (inner joins with equality predicates). When that is not the case it >> falls back on slower, but more general operators. If there are more subtle >> trade offs involved then we may need to wait until we have more statistics >> to help us make the choice. >> >> I'd suggest opening a JIRA and proposing a design before going too far. >> >> Michael >> >> >> On Sat, Jul 26, 2014 at 3:32 AM, Christos Kozanitis < >> kozani...@berkeley.edu> wrote: >> >>> Hello >>> >>> I was wondering is it easy for you guys to point me to what modules I >>> need to update if I had to add extra functionality to sparkSQL? >>> >>> I was thinking to implement a region-join operator and I guess I should >>> add the implementation details under joins.scala but what else do I need to >>> modify? >>> >>> thanks >>> Christos >>> >> >> >