Dear community,

I have a flat, really huge RDD describing a link-graph, sth. like

a.com/page1 -> b.com/page2
a.com/page1 -> a.com/page5
b.com/page5 -> a.com/page3

I want to calculate pagerank (with GraphX) for in-domain links, i.e.
pagerank for all pages of domain a.com, b.com. whatever.com. The number of
domains is in the 100 M range.

What I do: I filter (from anchor domain equal to the to anchor domain),
group by domain. And here is the problem: some domains have only 100
internal links, others have 100M links. I can't just do pagerank after
groupby domain / all in parallel, since this is a blow-up.

What I could do: store all unique domains in a RDD and access the links
domain by domain. So I would have 100 M individual calculations of pagerank,
will this ever end?

Any suggestions how to cope with the problem?

Thanks!
Adam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pagerank-Subgraphs-tp26453.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to