[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-15 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150869#comment-13150869
 ] 

Hudson commented on GIRAPH-11:
--

Integrated in Giraph-trunk-Commit #34 (See 
[https://builds.apache.org/job/Giraph-trunk-Commit/34/])
GIRAPH-88: Message count not updated properly after GIRAPH-11. (aching)

aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1202455
Files : 
* /incubator/giraph/trunk/CHANGELOG
* 
/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
* 
/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java
* 
/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java


 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, 
 GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-14 Thread Jakob Homan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150095#comment-13150095
 ] 

Jakob Homan commented on GIRAPH-11:
---

I think this is ready to go.  Avery, just out of curiosity, beyond the MR 
unittests, have you run any test vertices on this?  

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, 
 GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-14 Thread Avery Ching (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150097#comment-13150097
 ] 

Avery Ching commented on GIRAPH-11:
---

Ran local tests, MR tests, and ran the page rank benchmark with 100 m vertices. 
 No doubt we'll need to optimize more, but this needed to get done first.

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, 
 GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-14 Thread Jakob Homan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150099#comment-13150099
 ] 

Jakob Homan commented on GIRAPH-11:
---

+1

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, 
 GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-14 Thread Hyunsik Choi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150110#comment-13150110
 ] 

Hyunsik Choi commented on GIRAPH-11:


I also think that it's great at this moment.

+1

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, 
 GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-13 Thread Hyunsik Choi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149412#comment-13149412
 ] 

Hyunsik Choi commented on GIRAPH-11:


Avery, 

I'm sorry for delaying the review. Now, I'm digging your patch. 
That looks great! Based on this work, we can consider some advanced graph 
partitioner based on the number of edge-cuts on graph partitions.

I need about one more day for more investigation because the patch is somewhat 
complicated for me :) 

Besides, for the deeper review, I would like to execute the some tests and 
trace them. Your patch needs the rebase. Could you rebase the patch?

Thank you :)

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-13 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149468#comment-13149468
 ] 

jirapos...@reviews.apache.org commented on GIRAPH-11:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2788/
---

(Updated 2011-11-14 06:56:19.251685)


Review request for giraph.


Changes
---

Updated the diff as per Hyunsik's request to build against recent trunk 
changes.  While I was waiting I added some fixed and additions as well.

Upgrade ZooKeeper to 3.3.3 from 3.3.1.

Fixed bug in PseudoRandomVertexInputFormat.java where the edges are not fully 
added (hasEdge is not the right place to look for the edge).

Fixed bug in BasicRPCCommunications when putting to a local inPartitionMap

Added counter for last checkpointed superstep

Master should refresh the progress every 60 seconds while waiting for workers 
to ensure that the job isn't killed

Fixed bugs in vertexCounter, finishedVertexCoutner, edgeCounter, and 
sentMessages counter not resetting every update (just cumultatively being 
added).

Add additional helpful status messages for debugging.

Turned off speculative execution for Giraph (not a good idea).

Added analysis of the partition balancing for debugging


Summary
---

Warning: This is a very large change!

Vertex ranges no longer exist.  A generic partitioner handles the
division of vertex ids to partitions.  As a default, there is a
HashPartitioner and a HashRangePartitioner that will use the hashCode
of a Java object to decide which partition to place the vertex.
Developers can write their own algorithm to determine how to change
the partitioning as well as implement the assignment of partitions to
workers.  All vertices loaded from the input split are sent to the
owner of the partition rather than loaded locally.  This eliminates the
constraint that the vertices must be ordered in the input split.

The checkpoint format has been changed to suit the new partition
style.  Checkpoints are now a lot simpler.  The master will assign
partitions and the workers will only load their own partitions from
the checkpoint.

Unfortunately, the vertex range implementation was baked into almost
every aspect of the code (hence the ridiculous size of this diff).
But now it should be flexible to support several different graph
partitioning schemes (i.e. hash-based, hash-ranged-based, and for
special cases, fully ranged-based).

Sorry for the long delay, but this way pretty involved.


This addresses bug GIRAPH-11.
https://issues.apache.org/jira/browse/GIRAPH-11


Diffs (updated)
-

  http://svn.apache.org/repos/asf/incubator/giraph/trunk/pom.xml 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java
 1201607 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java
 PRE-CREATION 
  

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-13 Thread Hyunsik Choi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149471#comment-13149471
 ] 

Hyunsik Choi commented on GIRAPH-11:


Thank you for rebase.

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-10 Thread Avery Ching (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147916#comment-13147916
 ] 

Avery Ching commented on GIRAPH-11:
---

Sooo...anyone care to look at this?  I'd really like to get it since the old 
graph partitioning sucks.  One we have it in and test a bit, hopefully we can 
cut a release now..

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-10 Thread Hyunsik Choi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148187#comment-13148187
 ] 

Hyunsik Choi commented on GIRAPH-11:


That's a huge patch :)
I have just started to explore your patch.
I will leave some comments (maybe tomorrow).

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-09 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146944#comment-13146944
 ] 

jirapos...@reviews.apache.org commented on GIRAPH-11:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2788/
---

Review request for giraph.


Summary
---

Warning: This is a very large change!

Vertex ranges no longer exist.  A generic partitioner handles the
division of vertex ids to partitions.  As a default, there is a
HashPartitioner and a HashRangePartitioner that will use the hashCode
of a Java object to decide which partition to place the vertex.
Developers can write their own algorithm to determine how to change
the partitioning as well as implement the assignment of partitions to
workers.  All vertices loaded from the input split are sent to the
owner of the partition rather than loaded locally.  This eliminates the
constraint that the vertices must be ordered in the input split.

The checkpoint format has been changed to suit the new partition
style.  Checkpoints are now a lot simpler.  The master will assign
partitions and the workers will only load their own partitions from
the checkpoint.

Unfortunately, the vertex range implementation was baked into almost
every aspect of the code (hence the ridiculous size of this diff).
But now it should be flexible to support several different graph
partitioning schemes (i.e. hash-based, hash-ranged-based, and for
special cases, fully ranged-based).

Sorry for the long delay, but this way pretty involved.


This addresses bug GIRAPH-11.
https://issues.apache.org/jira/browse/GIRAPH-11


Diffs
-

  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java
 1196639 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java
 1196639 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java
 1196639 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java
 1196639 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/AutoBalancer.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
 1196639 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
 1198972 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
 1199643 
  
http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java
 PRE-CREATION 

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-09 Thread Jake Mannix (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147194#comment-13147194
 ] 

Jake Mannix commented on GIRAPH-11:
---

wow, there's a monster diff!  I'm glad I don't have any outstanding patches to 
get clobbered!

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-11-09 Thread Avery Ching (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147226#comment-13147226
 ] 

Avery Ching commented on GIRAPH-11:
---

I started it so long ago that the merge was brutal for me.  =)

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Affects Versions: 0.70.0
Reporter: Avery Ching
Assignee: Avery Ching
 Attachments: GIRAPH-11.diff


 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-09-16 Thread Severin Corsten (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13106636#comment-13106636
 ] 

Severin Corsten commented on GIRAPH-11:
---

yes, it's much clearer. If I can help you, just give me some task. I am not 
quite sure to what extent my code has the right quality for Giraph bud I will 
do my best.


 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Reporter: Avery Ching
Assignee: Avery Ching

 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-09-16 Thread Avery Ching (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13106839#comment-13106839
 ] 

Avery Ching commented on GIRAPH-11:
---

Hi Severin, as far as getting started, feel free to take a look at any of the 
open issues 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+GIRAPH+AND+status+%3D+Open+ORDER+BY+priority+DESCmode=hide)
 that are unassigned.  Also, if you think of any interesting features to add, 
file a JIRA.  We'd be happy to have you contribute =).

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Reporter: Avery Ching
Assignee: Avery Ching

 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-09-09 Thread Severin Corsten (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13101590#comment-13101590
 ] 

Severin Corsten commented on GIRAPH-11:
---

What hashing algorithm do you want to use for the hash-partitioning? Just a 
hash of the vertexid or a more complex think or is it the users choice?

How do you want to solve the messaging between the vertices when using hash 
partitioning? Will you store a Map with Vertex - Worker or provide the hash 
algorithem to workers, so that they can identify the destination worker by 
themself? 

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Reporter: Avery Ching
Assignee: Avery Ching

 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-09-06 Thread Dan Brickley (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13097840#comment-13097840
 ] 

Dan Brickley commented on GIRAPH-11:


Is there more detail somewhere on how 'range based' works?

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Reporter: Avery Ching
Assignee: Avery Ching

 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-09-06 Thread Avery Ching (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098098#comment-13098098
 ] 

Avery Ching commented on GIRAPH-11:
---

I'm going to assume you're asking about the current partitioning.  If I'm 
wrong, I'll address what we plan to do in the future.  The current partitioning 
is implemented by assuming that the input splits are sorted globally (i.e. two 
input split of {A, B, C} {D, E}).  It will break the input splits into vertex 
ranges where the boundaries will not change.  These vertex ranges can be passed 
around the workers via several different balancers.  The balancer can be set 
via setVertexRangeBalancerClass() from GiraphJob or with the right 
configuration parameter (giraph.vertexRangeBalancerClass).  We have some 
implementations for a static balancer (no vertex movement, default), and an 
auto balancer (configurable to balance based on vertices or edges).  You're 
free to implement your own as well.  Hope that answers some of the questions, 
let me know if you have more.

 Improve the graph distribution of Giraph
 

 Key: GIRAPH-11
 URL: https://issues.apache.org/jira/browse/GIRAPH-11
 Project: Giraph
  Issue Type: Improvement
Reporter: Avery Ching
Assignee: Avery Ching

 Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
  If the user data is not sorted by the vertex id, they must first run a 
 MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
 inconvenient.
 Giraph graph partitioning is currently range based and there are some 
 advantages and disadvantages of this approach.  The proposal of this JIRA 
 would be to allow for both range and hash based partitioning and provide more 
 flexibility to the user.
 Design goals for the graph distribution:
 * Allow vertices to be unordered or unordered
 * Ability to repartition
 * Select the partitioning scheme based on user needs (i.e. hash or range 
 based)
 * Ability to provide user-specific hints about partitions
 Hash-based partitioning
 * Good vertex balancing across ranges for random data
 * Bad at vertex id locality
 Range-based partitioning
 * Good at vertex id locality
 * Ability to split ranges easily
 * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira