#2 is not a bug. Have a search through JIRA. It is merely unformalized. I think that is how (one of?) the original PageRank papers does it.
On Thu, Jun 25, 2015, 7:39 AM Kelly, Terence P (HP Labs Researcher) < terence.p.ke...@hp.com> wrote: > Hi, > > Colleagues and I have found that the PageRank implementation bundled > with Spark is incorrect in several ways. The code in question is in > Apache Spark 1.2 distribution's "examples" directory, called > "SparkPageRank.scala". > > Consider the example graph presented in the colorful figure on the > Wikipedia page for "PageRank"; below is an edge list representation, > where vertex "A" is "1", "B" is "2", etc.: > > - - - - - begin > 2 3 > 3 2 > 4 1 > 4 2 > 5 2 > 5 4 > 5 6 > 6 2 > 6 5 > 7 2 > 7 5 > 8 2 > 8 5 > 9 2 > 9 5 > 10 5 > 11 5 > - - - - - end > > Here's the output we get from Spark's PageRank after 100 iterations: > > B has rank: 1.9184837009011475. > C has rank: 1.7807113697064196. > E has rank: 0.24301279014684984. > A has rank: 0.24301279014684984. > D has rank: 0.21885362387494078. > F has rank: 0.21885362387494078. > > There are three problems with the output: > > 1. Only six of the eleven vertices are represented in the output; > by definition, PageRank assigns a value to each vertex. > > 2. The values do not sum to 1.0; by definition, PageRank is a > probability vector with one element per vertex and the sum of > the elements of the vector must be 1.0. > > 3. Vertices E and A receive the same PageRank, whereas other means of > computing PageRank, e.g., our own homebrew code and the method > used by Wikipedia, assign different values to these vertices. Our > own code has been compared against the PageRank implementation in > the "NetworkX" package and it agrees. > > It looks like bug #1 is due to the Spark implementation of PageRank > not emitting output for vertices with no incoming edges and bug #3 is > due to the code not correctly handling vertices with no outgoing > edges. Once #1 and #3 are fixed, normalization might be all that's > required to fix #2 (maybe). > > We currently rely on the Spark PageRank for tests we're > conducting; when do you think a fix might be ready? > > Thanks. > > -- Terence Kelly, HP Labs > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >