Hello all, I am working on a graph problem using vanilla Spark (not GraphX) and at some point I would like to do a self join on an edges RDD[(srcID, dstID, w)] on the dst key, in order to get all pairs of incoming edges.
Since this is the performance bottleneck for my code, I was wondering if there any steps to take before performing the self-join in order to make it as efficient as possible. In the Learning Spark book <https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html> for example, in the "Data partitioning" section they recommend performing .partitionBy(new HashPartitioner(100)) on an RDD before joining it with another. Are there any guidelines for optimizing self-join performance? Regards, Theodore -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-self-joins-tp20576.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org