Hello all,

I am working on a graph problem using vanilla Spark (not GraphX) and at some
point I would like to do a
self join on an edges RDD[(srcID, dstID, w)] on the dst key, in order to get
all pairs of incoming edges.

Since this is the performance bottleneck for my code, I was wondering if
there any steps to take before
performing the self-join in order to make it as efficient as possible.

In the  Learning Spark book
<https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html>
  
for example, in the "Data partitioning" section they recommend
performing .partitionBy(new HashPartitioner(100)) on an RDD before joining
it with another.

Are there any guidelines for optimizing self-join performance?

Regards,
Theodore




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-self-joins-tp20576.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to