[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16271


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-15 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/16271#discussion_r92621591
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala ---
@@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite with 
LocalSparkContext {
   val resetProb = 0.15
   val errorTol = 1.0e-5
 
-  val staticRanks1 = starGraph.staticPageRank(numIter = 1, 
resetProb).vertices
-  val staticRanks2 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices.cache()
+  val staticRanks1 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices
--- End diff --

Not really more robust since it has a sink and thus is still wrong pending 
SPARK-18847. But it is needed with the change to fully propagate the change in 
rank of source vertices in the first iteration as explained above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16271#discussion_r92578668
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -225,11 +225,11 @@ object PageRank extends Logging {
 ctx => ctx.sendToDst(ctx.srcAttr :* ctx.attr),
 (a : BV[Double], b : BV[Double]) => a :+ b, TripletFields.Src)
 
-  rankGraph = rankGraph.joinVertices(rankUpdates) {
-(vid, oldRank, msgSum) =>
-  val popActivations: BV[Double] = msgSum :* (1.0 - resetProb)
+  rankGraph = rankGraph.outerJoinVertices(rankUpdates) {
--- End diff --

Is this a related but slightly separate fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16271#discussion_r92578753
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala ---
@@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite with 
LocalSparkContext {
   val resetProb = 0.15
   val errorTol = 1.0e-5
 
-  val staticRanks1 = starGraph.staticPageRank(numIter = 1, 
resetProb).vertices
-  val staticRanks2 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices.cache()
+  val staticRanks1 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices
--- End diff --

A few extra iterations makes the test more robust?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-13 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16271

[SPARK-18845][GraphX] PageRank has incorrect initialization value that 
leads to slow convergence

## What changes were proposed in this pull request?

Change the initial value in all PageRank implementations to be `1.0` 
instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of 
`joinVertices` so that source vertices get updated in each iteration. 

This seems to have been introduced a long time ago in 
https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90

With the exception of graphs with sinks (which currently give incorrect 
results see SPARK-18847) this gives faster convergence as the sum of ranks is 
already correct (sum or ranks should be number of vertices).

Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
Code for benchmark: 
https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd

## How was this patch tested?

(corrected) existing unit tests and additional test that verifies against 
result of igraph and NetworkX on a loop with a source.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark pagerank-initial-value

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16271.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16271


commit 9149ca233722e3d70aefb223dfd6a16ee8dbf924
Author: Andrew Ray 
Date:   2016-12-12T18:34:49Z

fix

commit b145376d88b6f5e58e2b9d051d9c268a36b9f939
Author: Andrew Ray 
Date:   2016-12-13T15:51:06Z

fix initial value for grid graph independent calculation

commit d39d2f07ab1a1aadb24dbd67bbbe37400beaadb4
Author: Andrew Ray 
Date:   2016-12-13T16:25:15Z

use outer join so that sources are updated and fix reset probability for 
personalized

commit 7ea03a88a3d9caa0ab7a7e6e681b8bf00b5cc128
Author: Andrew Ray 
Date:   2016-12-13T16:36:10Z

fix star page rank test to account for sources getting updated in the first 
iteration which then changes the center in the second iteration




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org