This might be because Spark SQL first does a shuffle on both the tables
involved in join on the Join condition as key.

I had a specific use case of join where I always Join on specific column
"id" and have an optimisation lined up for that in which i can cache the
data partitioned on JOIN key "id" and could prevent the shuffle by passing
the partition information to in-memory caching.

See - https://issues.apache.org/jira/browse/SPARK-4849

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751p20756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to