How long are the count() steps taking? And how many partitions are pairs1and
triples initially divided into? You can see this by doing
pairs1._jrdd.splits().size(), for example.

If you just need to count the number of distinct keys, is it faster if you
did the following instead of groupByKey().count()?

g1 = pairs1.map(lambda (k,v): k).distinct().count()
g2 = triples.map(lambda (k,v): k).distinct().count()

Nick


On Mon, Apr 21, 2014 at 10:42 PM, Joe L <selme...@yahoo.com> wrote:

> g1 = pairs1.groupByKey().count()
> pairs1 = pairs1.groupByKey(g1).cache()
> g2 = triples.groupByKey().count()
> pairs2 = pairs2.groupByKey(g2)
>
> pairs = pairs2.join(pairs1)
>
> Hi, I want to implement hash-partitioned joining as shown above. But
> somehow, it is taking so long to perform. As I understand, the above
> joining
> is only implemented locally right since they are partitioned respectively?
> After we partition, they will reside in the same node. So, isn't it
> supposed
> to be fast when we partition by keys. Thank you.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539p4577.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to