In Spark 1.0, outer join are resolved to BroadcastNestedLoopJoin. You can use Spark 1.1 which resolves outer join to hash joins.
Hope this helps! Liqua On Mon, Oct 6, 2014 at 4:20 PM, Benyi Wang <bewang.t...@gmail.com> wrote: > I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in > clouder'a maven repository. After put it into the classpath, I can use > spark-sql in my application. > > One of issue is that I couldn't make the join as a hash join. It gives > CartesianProduct when I join two SchemaRDDs as follows: > > scala> val event = > sqlContext.parquetFile("/events/2014-09-28").select('MediaEventID).join(log, > joinType=LeftOuter, on=Some("event.eventid".attr === "log.eventid".attr)) > == Query Plan == > BroadcastNestedLoopJoin LeftOuter, Some(('event.eventid = 'log.eventid)) > ParquetTableScan [eventid#130L], (ParquetRelation /events/2014-09-28), > None > ParquetTableScan [eventid#125L,listid#126L,isfavorite#127], > (ParquetRelation /logs/eventdt=2014-09-28), None > > If I join with another SchemaRDD, I would get Cartesian Product. Is it > possible that make the join as a hash join in Spark-1.0.0? > -- Liquan Pei Department of Physics University of Massachusetts Amherst