[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355664#comment-15355664 ] Vasia Kalavri commented on FLINK-3879: -- You and I are the only Gelly component "shepherds", but that doesn't mean we are the only ones that can review Gelly PRs. Any committer can help :) I like the idea about improving our process. I went through the JIRAs last week and pinged a few people, released some, closed some, but we can certainly clean up more. Big +1 for roadmap => JIRA => implementation discussion => PR. Do you think we should add this to the wiki /contribution guidelines / start a discussion in the dev list? > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355292#comment-15355292 ] Greg Hogan commented on FLINK-3879: --- Hi [~vkalavri], good to have you back! I appreciate your confidence. Have I correctly surmised that you are the only other current reviewer of Gelly PRs? I also prefer code reviews, but the noted algorithms would not be available in the 1.1 release if I had not triple-washed and merged the code. There was no dissension on these tickets and all review comments were resolved. If we need to improve our process I'd suggest starting at the top by keeping a clean sheet of Jira tickets. Ideas should be reserved for the Gelly roadmap, and any Jira ticket should be ready for and hopefully initiate a conversation discussing implementation details. It should be decided upfront or as early as possible whether to accept or reject a feature or improvement. With HITSAlgorithm it was identified early in the review that performance would likely be better using GSA and that performance would be degraded due to duplicating edges. +1 for collaboration, +1 for conversations, +1 for code reviews > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353687#comment-15353687 ] ASF GitHub Bot commented on FLINK-3879: --- Github user vasia commented on the issue: https://github.com/apache/flink/pull/1967 Hey @greghogan, was there consensus regarding this change? I see the numbers, but did anyone review this PR? I've been offline for the past few days, and I now see that nobody reviewed #2160, #2079, #2067, #1997 either... I don't doubt that you have done a great job, but it is _always_ better to let someone review your code before you merge. We don't usually merge PRs without a +1 unless it is something trivial. I understand things move faster this way, but we are in a community and we should try to collaborate. Please, leave a comment next time you think a PR has stayed with no review for a long time or ping me personally if you want a 2nd pair of eyes on gelly stuff :) Thanks! > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353650#comment-15353650 ] ASF GitHub Bot commented on FLINK-3879: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1967 > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353421#comment-15353421 ] Greg Hogan commented on FLINK-3879: --- Using CombineHint.HASH the HITS timings dropped from/to: scale 10: 1,034 ms -> 1,106 ms scale 12: 1,115 ms -> 1,090 ms scale 14: 1,974 ms -> 1,608 ms scale 16: 5,843 ms -> 3,841 ms scale 18: 21,927 ms -> 13,345 ms scale 20: 93,488 ms -> 54,570 ms Impressive [~ggevay]! > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353348#comment-15353348 ] Greg Hogan commented on FLINK-3879: --- I had not realized pr1517 was defaulting to the old sort-based combine rather than the new hash-based combine. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346899#comment-15346899 ] Greg Hogan commented on FLINK-3879: --- [~vkalavri] here are the timings I get on an AWS c4.2xlarge using FLINK-3879 rebased to master and merged with pr1517 (hash-based combine). Each execution is performing 10 iterations. FLINK-3879 (HITS): $ for i in `seq 10 2 20` ; do echo ; echo $i ; ./bin/flink run -q -class org.apache.flink.graph.examples.HITS ~/flink-gelly-examples_2.10-1.1-SNAPSHOT.jar --input rmat --scale $i --output hash --algorithm HITS ; done Scale 10: ChecksumHashCode 0x018d7c49ca4a, count 902 Execution runtime: 1,034 ms Scale 12: ChecksumHashCode 0x058a6efc82d1, count 3349 Execution runtime: 1,115 ms Scale 14: ChecksumHashCode 0x15030ecf3188, count 12472 Execution runtime: 1,974 ms Scale 16: ChecksumHashCode 0x4f23492849eb, count 46826 Execution runtime: 5,843 ms Scale 18: ChecksumHashCode 0x0001267c806b8338, count 174010 Execution runtime: 21,927 ms Scale 20: ChecksumHashCode 0x000449bd0da45343, count 646203 Execution runtime: 93,488 ms FLINK-2044 (HITSAlgorithm): $ for i in `seq 10 2 20` ; do echo ; echo $i ; ./bin/flink run -q -class org.apache.flink.graph.examples.HITS ~/flink-gelly-examples_2.10-1.1-SNAPSHOT.jar --input rmat --scale $i --output hash --algorithm HITSAlgorithm ; done Scale 10: ChecksumHashCode 0x01c88b0818f0, count 902 Execution runtime: 761 ms Scale 12: Cluster retrieved ChecksumHashCode 0x069bcad3b322, count 3349 Execution runtime: 1,094 ms Scale 14: ChecksumHashCode 0x186c50950fea, count 12472 Execution runtime: 2,290 ms Scale 16: ChecksumHashCode 0x5b741faf30eb, count 46826 Execution runtime: 6,898 ms Scale 18: ChecksumHashCode 0x000153e520a8306c, count 174010 Execution runtime: 28,015 ms Scale 20: ChecksumHashCode 0x0004ed44e75c493a, count 646203 Execution runtime: 120,736 ms > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344286#comment-15344286 ] Greg Hogan commented on FLINK-3879: --- When I tested the performance of FLINK-3879 it was significantly faster than FLINK-2044, which is why I had thought we were keeping the latter as an example. It's likely that rewriting 2044 using GSA would close some of that gap but there will always be a penalty for duplicating vertices. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338592#comment-15338592 ] Vasia Kalavri commented on FLINK-3879: -- Hey [~greghogan], since FLINK-2044 was updated to provide convergence and return both scores, do we still need this issue? > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280826#comment-15280826 ] Greg Hogan commented on FLINK-3879: --- The current expectation for FLINK-2044 is that the user will have some intuition on how many iterations to run. If there is a specific best value then it should be hard-coded in the program; otherwise, we should provide guidance on how to select the number of iterations. I do not see discussion of a convergence threshold in the HITS paper, but borrowing this aspect of PageRank seems appropriate. A threshold of 1E-9 requires that the average score is changing by less than one part in a billion. That seems easier to conceptualize than setting the number of iterations. I think a benchmark is always good for context when comparing implementations. When does Scatter-Gather outperform GSA? I see that some algorithms cannot be implemented using GSA, but that is clearly not the case for these algorithms. The only delta iteration I am familiar with is Stephan's implementation of PageRank. Are you thinking of a similar modification to HITS? It should be straightforward to implement this and compare techniques for HITS and PageRank. Convergence should not be affected, but vertex scores below a given threshold will stop propagating deltas. https://gist.github.com/StephanEwen/2b1a4c9812ac46abc8f0 > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280246#comment-15280246 ] Vasia Kalavri commented on FLINK-3879: -- [~greghogan] - Do we agree that the PR for FLINK-2044 is now in good state and could be merged? Or would you rather benchmark this against it and go for the most performant one? - Gelly library methods: currently there are scatter-gather and GSA implementations for PageRank, Connected Components, and SSSP. We have these because GSA performs better for graphs with skewed degree distributions. In the Gelly docs-[iteration abstractions comparison|https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html#iteration-abstractions-comparison], we describe when GSA should be preferred over scatter-gather. Maybe we can make this more explicit. There is no Pregel implementation (only in examples). The {{GSATriangleCount}} library method has proved to be very inefficient and should be removed imo (I'll open a JIRA). - I'm not sure what you mean by "approximate HITS"? > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280134#comment-15280134 ] Greg Hogan commented on FLINK-3879: --- Implementations which merely showcase the use of Gelly graph models seem most appropriate as examples. Do we have examples of inputs which perform better as GSA vs SG vs Pregel? I am not finding any direct guidance in the documentation for a user looking to choose between duplicate library algorithms. FLINK-3879 will be faster unless one of the current models is extended or a new graph model is created to process out- and in- edges separately in the same iteration or to allow disabling operators on certain supersteps. An approximate HITS using delta iterations would be as easy to implement natively as with GSA. Before accepting such an implementation I would like to see evidence that performing more approximate iterations converges more quickly when compared with running fewer bulk iterations. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279918#comment-15279918 ] Vasia Kalavri commented on FLINK-3879: -- Gelly has multiple implementations for some algorithms to showcase how the different iteration abstractions can be used. Also, for some graph inputs an implementation might perform better than another (e.g. scatter-gather vs gsa). That doesn't mean we should add multiple implementations for all new algorithms :) Now regarding performance, I'm not quite sure that FLINK-3879 will perform better than FLINK-2044. I haven't looked at the PR in detail, but I saw that it uses a bulk iteration. That means that a new partial solution is generated in every iteration and we cannot take advantage of the asymmetric convergence (if any). > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277514#comment-15277514 ] GaoLun commented on FLINK-3879: --- Scatter-gather model divides original iteration into two parts. For performance, 3879 is better. Maybe it's good to keep both with different path. One for performance, another for being consistent with other algorithm implementations used same iteration model. :) > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276761#comment-15276761 ] Greg Hogan commented on FLINK-3879: --- That is a very "meta" question :) Gelly currently contains multiple implementations for Connected Components, PageRank, and Single Source Shortest Paths. Is the goal to choose one implementation for the library and keep alternate implementations in {{flink-gelly-examples}}? [~gallenvara_bg] has FLINK-2044 looking pretty good. Currently there is a limitation of the Scatter-Gather model which requires an implementation of HITS to waste half of the join operations. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276102#comment-15276102 ] Vasia Kalavri commented on FLINK-3879: -- Hey [~greghogan], [~gallenvara_bg], what's the plan with this issue and FLINK-2044? We now have 2 JIRAs and 2 implementations for the same algorithm. Is the intention to only keep one or both? Do they provide substantially different functionality? Thanks! > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275265#comment-15275265 ] GaoLun commented on FLINK-3879: --- yes, make sense. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275242#comment-15275242 ] GaoLun commented on FLINK-3879: --- Hub values are same but authority values have a little difference. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275240#comment-15275240 ] GaoLun commented on FLINK-3879: --- My: (1,0.847998304005088,0.0), (2,0.5299989400031799,0.5144957554275266), (3,0.0,0.8574929257125442) Yours: (1,0.8479983040050879,0.0), (2,0.5299989400031799,0.5240974256643347), (3,0.0,0.8516583167045438) > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275169#comment-15275169 ] Greg Hogan commented on FLINK-3879: --- What are 2044's authority scores after two iterations? > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275160#comment-15275160 ] GaoLun commented on FLINK-3879: --- Hi [~greghogan], the PR of FLINK-2044 has been updated and support returning both value now. And i have changed the normalization method from sum to square sum. I wrote a simple test for your implementation to compare the result with mine, but i find the result is different. For a simple graph: {{1->2, 1->3, 2->3}} with one iteration result : Mine: {{(1,0.8320502943378436,0.0), (2,0.554700196225229,0.4472135954999579), (3,0.0,0.894427190159)}} Yours: {{(1,0.8320502943378437,0.0), (2,0.5547001962252291,0.5144957554275265), (3,0.0,0.8574929257125441)}} We can calculate the hub/authority value manually, the result should be: {{(1, sqrt(9/13), 0.0), (2,sqrt(4/13), 1/sqrt(5)), (3, 0.0, 2/sqrt(5))}} which is a little different with yours. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274307#comment-15274307 ] Greg Hogan commented on FLINK-3879: --- I think FLINK-2044 can test for convergence if the last three scores are stored for each vertex. That would also make it trivial to return both scores. So, yes, performance. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274270#comment-15274270 ] Vasia Kalavri commented on FLINK-3879: -- So, it provides the same functionality as FLINK-2044, but it's more efficient? > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274254#comment-15274254 ] Vasia Kalavri commented on FLINK-3879: -- Hi [~greghogan], Can you please extend the description a bit? How is algorithm different than what FLINK-2044 proposes and what's the motivation for adding it to Gelly? Thanks! > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274225#comment-15274225 ] ASF GitHub Bot commented on FLINK-3879: --- GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/1967 [FLINK-3879] [gelly] Native implementation of HITS algorithm You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 3879_native_implementation_of_hits_algorithm Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1967.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1967 commit 724a86b5b5e7e7a93392048d5842fe54df7c4bfe Author: Greg HoganDate: 2016-05-06T15:13:26Z [FLINK-3879] [gelly] Native implementation of HITS algorithm > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)