How long are the count() steps taking? And how many partitions are pairs1and triples initially divided into? You can see this by doing pairs1._jrdd.splits().size(), for example.
If you just need to count the number of distinct keys, is it faster if you did the following instead of groupByKey().count()? g1 = pairs1.map(lambda (k,v): k).distinct().count() g2 = triples.map(lambda (k,v): k).distinct().count() Nick On Mon, Apr 21, 2014 at 10:42 PM, Joe L <selme...@yahoo.com> wrote: > g1 = pairs1.groupByKey().count() > pairs1 = pairs1.groupByKey(g1).cache() > g2 = triples.groupByKey().count() > pairs2 = pairs2.groupByKey(g2) > > pairs = pairs2.join(pairs1) > > Hi, I want to implement hash-partitioned joining as shown above. But > somehow, it is taking so long to perform. As I understand, the above > joining > is only implemented locally right since they are partitioned respectively? > After we partition, they will reside in the same node. So, isn't it > supposed > to be fast when we partition by keys. Thank you. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539p4577.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >