Re: bugs in Spark PageRank implementation

Sean Owen Thu, 25 Jun 2015 00:38:19 -0700

#2 is not a bug. Have a search through JIRA. It is merely unformalized. I
think that is how (one of?) the original PageRank papers does it.


On Thu, Jun 25, 2015, 7:39 AM Kelly, Terence P (HP Labs Researcher) <
terence.p.ke...@hp.com> wrote:

> Hi,
>
> Colleagues and I have found that the PageRank implementation bundled
> with Spark is incorrect in several ways.  The code in question is in
> Apache Spark 1.2 distribution's "examples" directory, called
> "SparkPageRank.scala".
>
> Consider the example graph presented in the colorful figure on the
> Wikipedia page for "PageRank"; below is an edge list representation,
> where vertex "A" is "1", "B" is "2", etc.:
>
> - - - - - begin
> 2 3
> 3 2
> 4 1
> 4 2
> 5 2
> 5 4
> 5 6
> 6 2
> 6 5
> 7 2
> 7 5
> 8 2
> 8 5
> 9 2
> 9 5
> 10 5
> 11 5
> - - - - - end
>
> Here's the output we get from Spark's PageRank after 100 iterations:
>
> B has rank: 1.9184837009011475.
> C has rank: 1.7807113697064196.
> E has rank: 0.24301279014684984.
> A has rank: 0.24301279014684984.
> D has rank: 0.21885362387494078.
> F has rank: 0.21885362387494078.
>
> There are three problems with the output:
>
> 1. Only six of the eleven vertices are represented in the output;
>    by definition, PageRank assigns a value to each vertex.
>
> 2. The values do not sum to 1.0; by definition, PageRank is a
>    probability vector with one element per vertex and the sum of
>    the elements of the vector must be 1.0.
>
> 3. Vertices E and A receive the same PageRank, whereas other means of
>    computing PageRank, e.g., our own homebrew code and the method
>    used by Wikipedia, assign different values to these vertices.  Our
>    own code has been compared against the PageRank implementation in
>    the "NetworkX" package and it agrees.
>
> It looks like bug #1 is due to the Spark implementation of PageRank
> not emitting output for vertices with no incoming edges and bug #3 is
> due to the code not correctly handling vertices with no outgoing
> edges.  Once #1 and #3 are fixed, normalization might be all that's
> required to fix #2 (maybe).
>
> We currently rely on the Spark PageRank for tests we're
> conducting; when do you think a fix might be ready?
>
> Thanks.
>
> -- Terence Kelly, HP Labs
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: bugs in Spark PageRank implementation

Reply via email to