[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-29 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355664#comment-15355664
 ] 

Vasia Kalavri commented on FLINK-3879:
--

You and I are the only Gelly component "shepherds", but that doesn't mean we 
are the only ones that can review Gelly PRs. Any committer can help :)
I like the idea about improving our process. I went through the JIRAs last week 
and pinged a few people, released some, closed some, but we can certainly clean 
up more.
Big +1 for roadmap => JIRA => implementation discussion => PR.
Do you think we should add this to the wiki /contribution guidelines / start a 
discussion in the dev list?

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-29 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355292#comment-15355292
 ] 

Greg Hogan commented on FLINK-3879:
---

Hi [~vkalavri], good to have you back!

I appreciate your confidence. Have I correctly surmised that you are the only 
other current reviewer of Gelly PRs? I also prefer code reviews, but the noted 
algorithms would not be available in the 1.1 release if I had not triple-washed 
and merged the code. There was no dissension on these tickets and all review 
comments were resolved.

If we need to improve our process I'd suggest starting at the top by keeping a 
clean sheet of Jira tickets. Ideas should be reserved for the Gelly roadmap, 
and any Jira ticket should be ready for and hopefully initiate a conversation 
discussing implementation details. It should be decided upfront or as early as 
possible whether to accept or reject a feature or improvement. With 
HITSAlgorithm it was identified early in the review that performance would 
likely be better using GSA and that performance would be degraded due to 
duplicating edges.

+1 for collaboration, +1 for conversations, +1 for code reviews

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353687#comment-15353687
 ] 

ASF GitHub Bot commented on FLINK-3879:
---

Github user vasia commented on the issue:

https://github.com/apache/flink/pull/1967
  
Hey @greghogan,
was there consensus regarding this change? I see the numbers, but did 
anyone review this PR?
I've been offline for the past few days, and I now see that nobody reviewed 
#2160, #2079, #2067, #1997  either...
I don't doubt that you have done a great job, but it is _always_ better to 
let someone review your code before you merge. We don't usually merge PRs 
without a +1 unless it is something trivial. I understand things move faster 
this way, but we are in a community and we should try to collaborate.
Please, leave a comment next time you think a PR has stayed with no review 
for a long time or ping me personally if you want a 2nd pair of eyes on gelly 
stuff :)
Thanks!


> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353650#comment-15353650
 ] 

ASF GitHub Bot commented on FLINK-3879:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1967


> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-28 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353421#comment-15353421
 ] 

Greg Hogan commented on FLINK-3879:
---

Using CombineHint.HASH the HITS timings dropped from/to:

scale 10: 1,034 ms -> 1,106 ms
scale 12: 1,115 ms -> 1,090 ms
scale 14: 1,974 ms -> 1,608 ms
scale 16: 5,843 ms -> 3,841 ms
scale 18: 21,927 ms -> 13,345 ms
scale 20: 93,488 ms -> 54,570 ms

Impressive [~ggevay]!

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-28 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353348#comment-15353348
 ] 

Greg Hogan commented on FLINK-3879:
---

I had not realized pr1517 was defaulting to the old sort-based combine rather 
than the new hash-based combine.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-23 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346899#comment-15346899
 ] 

Greg Hogan commented on FLINK-3879:
---

[~vkalavri] here are the timings I get on an AWS c4.2xlarge using FLINK-3879 
rebased to master and merged with pr1517 (hash-based combine). Each execution 
is performing 10 iterations.

FLINK-3879 (HITS):

$ for i in `seq 10 2 20` ; do echo ; echo $i ; ./bin/flink run -q -class 
org.apache.flink.graph.examples.HITS 
~/flink-gelly-examples_2.10-1.1-SNAPSHOT.jar --input rmat --scale $i --output 
hash --algorithm HITS ; done

Scale 10:
ChecksumHashCode 0x018d7c49ca4a, count 902
Execution runtime: 1,034 ms

Scale 12:
ChecksumHashCode 0x058a6efc82d1, count 3349
Execution runtime: 1,115 ms

Scale 14:
ChecksumHashCode 0x15030ecf3188, count 12472
Execution runtime: 1,974 ms

Scale 16:
ChecksumHashCode 0x4f23492849eb, count 46826
Execution runtime: 5,843 ms

Scale 18:
ChecksumHashCode 0x0001267c806b8338, count 174010
Execution runtime: 21,927 ms

Scale 20:
ChecksumHashCode 0x000449bd0da45343, count 646203
Execution runtime: 93,488 ms

FLINK-2044 (HITSAlgorithm):

$ for i in `seq 10 2 20` ; do echo ; echo $i ; ./bin/flink run -q -class 
org.apache.flink.graph.examples.HITS 
~/flink-gelly-examples_2.10-1.1-SNAPSHOT.jar --input rmat --scale $i --output 
hash --algorithm HITSAlgorithm ; done

Scale 10:
ChecksumHashCode 0x01c88b0818f0, count 902
Execution runtime: 761 ms

Scale 12:
Cluster retrieved
ChecksumHashCode 0x069bcad3b322, count 3349
Execution runtime: 1,094 ms

Scale 14:
ChecksumHashCode 0x186c50950fea, count 12472
Execution runtime: 2,290 ms

Scale 16:
ChecksumHashCode 0x5b741faf30eb, count 46826
Execution runtime: 6,898 ms

Scale 18:
ChecksumHashCode 0x000153e520a8306c, count 174010
Execution runtime: 28,015 ms

Scale 20:
ChecksumHashCode 0x0004ed44e75c493a, count 646203
Execution runtime: 120,736 ms

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-22 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344286#comment-15344286
 ] 

Greg Hogan commented on FLINK-3879:
---

When I tested the performance of FLINK-3879 it was significantly faster than 
FLINK-2044, which is why I had thought we were keeping the latter as an 
example. It's likely that rewriting 2044 using GSA would close some of that gap 
but there will always be a penalty for duplicating vertices.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-06-19 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338592#comment-15338592
 ] 

Vasia Kalavri commented on FLINK-3879:
--

Hey [~greghogan], since FLINK-2044 was updated to provide convergence and 
return both scores, do we still need this issue?

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-11 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280826#comment-15280826
 ] 

Greg Hogan commented on FLINK-3879:
---

The current expectation for FLINK-2044 is that the user will have some 
intuition on how many iterations to run. If there is a specific best value then 
it should be hard-coded in the program; otherwise, we should provide guidance 
on how to select the number of iterations. I do not see discussion of a 
convergence threshold in the HITS paper, but borrowing this aspect of PageRank 
seems appropriate. A threshold of 1E-9 requires that the average score is 
changing by less than one part in a billion. That seems easier to conceptualize 
than setting the number of iterations.

I think a benchmark is always good for context when comparing implementations.

When does Scatter-Gather outperform GSA? I see that some algorithms cannot be 
implemented using GSA, but that is clearly not the case for these algorithms.

The only delta iteration I am familiar with is Stephan's implementation of 
PageRank. Are you thinking of a similar modification to HITS? It should be 
straightforward to implement this and compare techniques for HITS and PageRank. 
Convergence should not be affected, but vertex scores below a given threshold 
will stop propagating deltas.
  https://gist.github.com/StephanEwen/2b1a4c9812ac46abc8f0

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-11 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280246#comment-15280246
 ] 

Vasia Kalavri commented on FLINK-3879:
--

[~greghogan]
- Do we agree that the PR for FLINK-2044 is now in good state and could be 
merged? Or would you rather benchmark this against it and go for the most 
performant one?
- Gelly library methods: currently there are scatter-gather and GSA 
implementations for PageRank, Connected Components, and SSSP. We have these 
because GSA performs better for graphs with skewed degree distributions. In the 
Gelly docs-[iteration abstractions 
comparison|https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html#iteration-abstractions-comparison],
 we describe when GSA should be preferred over scatter-gather. Maybe we can 
make this more explicit.
There is no Pregel implementation (only in examples). The {{GSATriangleCount}} 
library method has proved to be very inefficient and should be removed imo 
(I'll open a JIRA).
- I'm not sure what you mean by "approximate HITS"?

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-11 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280134#comment-15280134
 ] 

Greg Hogan commented on FLINK-3879:
---

Implementations which merely showcase the use of Gelly graph models seem most 
appropriate as examples. Do we have examples of inputs which perform better as 
GSA vs SG vs Pregel? I am not finding any direct guidance in the documentation 
for a user looking to choose between duplicate library algorithms.

FLINK-3879 will be faster unless one of the current models is extended or a new 
graph model is created to process out- and in- edges separately in the same 
iteration or to allow disabling operators on certain supersteps.

An approximate HITS using delta iterations would be as easy to implement 
natively as with GSA. Before accepting such an implementation I would like to 
see evidence that performing more approximate iterations converges more quickly 
when compared with running fewer bulk iterations.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-11 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279918#comment-15279918
 ] 

Vasia Kalavri commented on FLINK-3879:
--

Gelly has multiple implementations for some algorithms to showcase how the 
different iteration abstractions can be used. Also, for some graph inputs an 
implementation might perform better than another (e.g. scatter-gather vs gsa). 
That doesn't mean we should add multiple implementations for all new algorithms 
:)
Now regarding performance, I'm not quite sure that FLINK-3879 will perform 
better than FLINK-2044. I haven't looked at the PR in detail, but I saw that it 
uses a bulk iteration. That means that a new partial solution is generated in 
every iteration and we cannot take advantage of the asymmetric convergence (if 
any).


> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-09 Thread GaoLun (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277514#comment-15277514
 ] 

GaoLun commented on FLINK-3879:
---

Scatter-gather model divides original iteration into two parts. For 
performance, 3879 is better. Maybe it's good to keep both with different path. 
One for performance, another for being consistent with other algorithm 
implementations used same iteration model. :)

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-09 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276761#comment-15276761
 ] 

Greg Hogan commented on FLINK-3879:
---

That is a very "meta" question :) Gelly currently contains multiple 
implementations for Connected Components, PageRank, and Single Source Shortest 
Paths. Is the goal to choose one implementation for the library and keep 
alternate implementations in {{flink-gelly-examples}}?

[~gallenvara_bg] has FLINK-2044 looking pretty good. Currently there is a 
limitation of the Scatter-Gather model which requires an implementation of HITS 
to waste half of the join operations.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-09 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276102#comment-15276102
 ] 

Vasia Kalavri commented on FLINK-3879:
--

Hey [~greghogan], [~gallenvara_bg],
what's the plan with this issue and FLINK-2044? We now have 2 JIRAs and 2 
implementations for the same algorithm.
Is the intention to only keep one or both? Do they provide substantially 
different functionality? Thanks!

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-07 Thread GaoLun (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275265#comment-15275265
 ] 

GaoLun commented on FLINK-3879:
---

yes, make sense.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-07 Thread GaoLun (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275242#comment-15275242
 ] 

GaoLun commented on FLINK-3879:
---

Hub values are same but authority values have a little difference.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-07 Thread GaoLun (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275240#comment-15275240
 ] 

GaoLun commented on FLINK-3879:
---

My: (1,0.847998304005088,0.0), (2,0.5299989400031799,0.5144957554275266), 
(3,0.0,0.8574929257125442)
Yours: (1,0.8479983040050879,0.0), (2,0.5299989400031799,0.5240974256643347), 
(3,0.0,0.8516583167045438)

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-07 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275169#comment-15275169
 ] 

Greg Hogan commented on FLINK-3879:
---

What are 2044's authority scores after two iterations?

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-07 Thread GaoLun (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275160#comment-15275160
 ] 

GaoLun commented on FLINK-3879:
---

Hi [~greghogan], the PR of FLINK-2044 has been updated and support returning 
both value now. And i have changed the normalization method from sum to square 
sum.
I wrote a simple test for your implementation to compare the result with mine, 
but i find the result is different.
For a simple graph: {{1->2, 1->3, 2->3}} with one iteration
result :
Mine:  {{(1,0.8320502943378436,0.0), (2,0.554700196225229,0.4472135954999579), 
(3,0.0,0.894427190159)}}
Yours: {{(1,0.8320502943378437,0.0), (2,0.5547001962252291,0.5144957554275265), 
(3,0.0,0.8574929257125441)}}
We can calculate the hub/authority value manually, the result should be:
{{(1, sqrt(9/13), 0.0), (2,sqrt(4/13), 1/sqrt(5)), (3, 0.0, 2/sqrt(5))}}
which is a little different with yours.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274307#comment-15274307
 ] 

Greg Hogan commented on FLINK-3879:
---

I think FLINK-2044 can test for convergence if the last three scores are stored 
for each vertex. That would also make it trivial to return both scores. So, 
yes, performance.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274270#comment-15274270
 ] 

Vasia Kalavri commented on FLINK-3879:
--

So, it provides the same functionality as FLINK-2044, but it's more efficient?

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274254#comment-15274254
 ] 

Vasia Kalavri commented on FLINK-3879:
--

Hi [~greghogan],
Can you please extend the description a bit? How is algorithm different than 
what FLINK-2044 proposes and what's the motivation for adding it to Gelly?
Thanks!

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274225#comment-15274225
 ] 

ASF GitHub Bot commented on FLINK-3879:
---

GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/1967

[FLINK-3879] [gelly] Native implementation of HITS algorithm



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 
3879_native_implementation_of_hits_algorithm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1967.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1967


commit 724a86b5b5e7e7a93392048d5842fe54df7c4bfe
Author: Greg Hogan 
Date:   2016-05-06T15:13:26Z

[FLINK-3879] [gelly] Native implementation of HITS algorithm




> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)