[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

2015-09-14 Thread andralungu
Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/1124#issuecomment-140131099
  
Hi @vasia ,

There was no clear rejection in the discussion, so I thought I'd try my 
luck :)

This technique could be enabled by using a flag only if we introduce a node 
split version of all the library methods. Otherwise, for newly written code, 
users would still have to pass a combine function according to their algorithm. 
They'd also have to pass the threshold that differentiates high-degree vertices 
from low-degree vertices. What is more, they still need to split and aggregate 
before and after each step.

These are aspects that cannot be guessed and that highly depend on the 
algorithm. 

Also, what I had in mind regarding this PR was that we would discuss and  
try to improve the code a bit, rather than immediately discarding it :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

2015-09-14 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/1124#issuecomment-140146467
  
Sure, I'm not saying you shouldn't have opened this PR. I'm trying to save 
you some time.

So, my opinion is that in order for this to be useful, we need 2 things:
- Understand the performance implications of the method. When is it 
beneficial? What's the memory overhead? What's the pre-processing overhead? How 
does it depend on the input? What if I give a wrong threshold? etc.
- Make this as automatic as possible. Ideally, the user would only have to 
give the combiner function and the threshold. Splitting and merge should be 
handled internally. Actually, we could probably find ways to automatically 
determine the threshold, too, based on the graph, the current load and the 
available memory.

These are not easy tasks and will need a lot of work. That's why I'm 
proposing to link to the current state of this method from the Gelly docs, so 
that we can get feedback. We could create something like a "research projects 
on Gelly" page, where we link to work in progress from our roadmap tasks.

That's just my view, but let's see what other people think :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

2015-09-14 Thread andralungu
Github user andralungu closed the pull request at:

https://github.com/apache/flink/pull/1124


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

2015-09-13 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/1124#issuecomment-139959564
  
Hi @andralungu,

I was under the impression that we never really reached consensus in the 
mailing list thread regarding this addition.

I definitely think we need to handle skewed graphs and you've done great 
work, but I wouldn't add this to Gelly at its current state.

- This is a very recent method that has not been tested thoroughly (apart 
from the experiments in your thesis work). Its benefits and overheads are not 
yet well understood.
- The API certainly needs rethinking. This should be a transparent method 
that would be easy to activate with a flag/option. Right now, it seems to me 
that it's too complicated to use and can easily allow erroneous implementations.

For now I would suggest that we keep this in your personal repository and 
we link to it from the Gelly documentation as additional/experimental feature. 
After we have better understanding of the technique and we have thought of a 
nicer API, we can reconsider adding it. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

2015-09-12 Thread andralungu
GitHub user andralungu opened a pull request:

https://github.com/apache/flink/pull/1124

[FLINK-2661] Added Node Splitting Methods

Social media graphs, citation networks or even protein networks have a 
common property: their degree distribution follows a power-law curve. This 
structure raises challenges to both vertex-centric and GSA/GAS models because 
they uniformly process vertices, regardless of their degree distribution. This 
leads to large execution time stalls: vertices wait for skewed nodes to finish 
computation [synchronous]. 

This PR aims to diminish the impact of high-degree nodes by proposing four 
main functions: `determinieSkewedVertices`, `treeDeAggregate` (splits a node 
into subnodes, recursively, in levels), `propagateValuesToSplitVertices` 
(useful when the algorithm performs more than one superstep), `treeAggregate` 
(brings the graph back to its initial state). 

These functions modify a graph at a high-level, making its degree 
distribution more uniform. The method does not need any special partitioning or 
runtime modification and (for skewed networks and computationally intensive 
algorithms) can speed up the run time by a factor of two. 

I added an example: NodeSplittingJaccardSimilarityMeasure, for which I 
needed to split the overall sequence of operations to two functions to be able 
to test the result. Calling the entire main method would have resulted in the 
"Two few memory segments etc" exception - too many operations called within one 
test, in other words. 

For more info, please consult the additional entry in the documentation. 

If we reach a common point and this PR gets merged, I will also follow 
@fhueske 's suggestion from the mailing list - adding a Split version of each 
of the library methods to allow users to verify whether their run time 
improves. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andralungu/flink splitJaccardFlink

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1124


commit b02d0917edcd5f3c8846fe01044afd7444a58c08
Author: Andra Lungu 
Date:   2015-09-12T10:25:20Z

[FLINK-2661] Added Node Splitting Methods

[FLINK-2661] Minor modifications in the docs

[FLINK-2661] pdf to png




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---