Have you tried joins on regular RDD instead of schemaRDD? We have found
that its 10 times faster than joins between schemaRDDs.
val largeRDD = ...
val smallRDD = ...
largeRDD.join(smallRDD) // otherway JOIN would run for long.
Only limitation i see with that implementation is regular RDD suppor
Hi Cheng,
Thank you very much for taking your time and providing a detailed
explanation.
I tried a few things you suggested and some more things.
The ContactDetail table (8 GB) is the fact table and DAgents is the Dim
table (<500 KB), reverse of what you are assuming, but your ideas still
apply.
Hey Venkat,
This behavior seems reasonable. According to the table name, I guess
here |DAgents| should be the fact table and |ContactDetails| is the dim
table. Below is an explanation of a similar query, you may see |src| as
|DAgents| and |src1| as |ContactDetails|.
|0: jdbc:hive2://localhos
Bump up.
Michael Armbrust, anybody from Spark SQL team?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-table-Join-one-task-is-taking-long-tp20124p20218.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---