Re: How to make Spark-sql join using HashJoin

Liquan Pei Mon, 06 Oct 2014 16:24:52 -0700

In Spark 1.0, outer join are resolved to BroadcastNestedLoopJoin. You can
use Spark 1.1 which resolves outer join to hash joins.


Hope this helps!
Liqua

On Mon, Oct 6, 2014 at 4:20 PM, Benyi Wang <bewang.t...@gmail.com> wrote:

> I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in
> clouder'a maven repository. After put it into the classpath, I can use
> spark-sql in my application.
>
> One of issue is that I couldn't make the join as a hash join. It gives
> CartesianProduct when I join two SchemaRDDs as follows:
>
> scala> val event =
> sqlContext.parquetFile("/events/2014-09-28").select('MediaEventID).join(log,
> joinType=LeftOuter, on=Some("event.eventid".attr === "log.eventid".attr))
> == Query Plan ==
> BroadcastNestedLoopJoin LeftOuter, Some(('event.eventid = 'log.eventid))
>  ParquetTableScan [eventid#130L], (ParquetRelation /events/2014-09-28),
> None
>  ParquetTableScan [eventid#125L,listid#126L,isfavorite#127],
> (ParquetRelation /logs/eventdt=2014-09-28), None
>
> If I join with another SchemaRDD, I would get Cartesian Product. Is it
> possible that make the join as a hash join in Spark-1.0.0?
>



-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: How to make Spark-sql join using HashJoin

Reply via email to