In Spark 1.0, outer join are resolved to BroadcastNestedLoopJoin. You can
use Spark 1.1 which resolves outer join to hash joins.

Hope this helps!
Liqua

On Mon, Oct 6, 2014 at 4:20 PM, Benyi Wang <bewang.t...@gmail.com> wrote:

> I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in
> clouder'a maven repository. After put it into the classpath, I can use
> spark-sql in my application.
>
> One of issue is that I couldn't make the join as a hash join. It gives
> CartesianProduct when I join two SchemaRDDs as follows:
>
> scala> val event =
> sqlContext.parquetFile("/events/2014-09-28").select('MediaEventID).join(log,
> joinType=LeftOuter, on=Some("event.eventid".attr === "log.eventid".attr))
> == Query Plan ==
> BroadcastNestedLoopJoin LeftOuter, Some(('event.eventid = 'log.eventid))
>  ParquetTableScan [eventid#130L], (ParquetRelation /events/2014-09-28),
> None
>  ParquetTableScan [eventid#125L,listid#126L,isfavorite#127],
> (ParquetRelation /logs/eventdt=2014-09-28), None
>
> If I join with another SchemaRDD, I would get Cartesian Product. Is it
> possible that make the join as a hash join in Spark-1.0.0?
>



-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Reply via email to