Dear community, I have a flat, really huge RDD describing a link-graph, sth. like
a.com/page1 -> b.com/page2 a.com/page1 -> a.com/page5 b.com/page5 -> a.com/page3 I want to calculate pagerank (with GraphX) for in-domain links, i.e. pagerank for all pages of domain a.com, b.com. whatever.com. The number of domains is in the 100 M range. What I do: I filter (from anchor domain equal to the to anchor domain), group by domain. And here is the problem: some domains have only 100 internal links, others have 100M links. I can't just do pagerank after groupby domain / all in parallel, since this is a blow-up. What I could do: store all unique domains in a RDD and access the links domain by domain. So I would have 100 M individual calculations of pagerank, will this ever end? Any suggestions how to cope with the problem? Thanks! Adam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pagerank-Subgraphs-tp26453.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
