[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228136#comment-14228136 ] Takeshi Yamamuro commented on SPARK-1987: - What is the status of this patch? This is related to a issue I created (https://issues.apache.org/jira/browse/SPARK-4646). I refactored this patch based on my patch, it is as follows: https://github.com/maropu/spark/commit/77e34424a5e6cf2bfd6300ab35f329bdaba6e775 Thanks :) More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228143#comment-14228143 ] Larry Xiao commented on SPARK-1987: --- [~maropu] I think it needs slight change in build system. I see your patch, cool idea, didn't know about timsort before, and your code looks very clear. :) More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228208#comment-14228208 ] Takeshi Yamamuro commented on SPARK-1987: - Thanks for your review! :)) What's the change in the system? Anyway, if no problem, I'll send PR. Thanks, again. takeshi More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138970#comment-14138970 ] Apache Spark commented on SPARK-1987: - User 'larryxiao' has created a pull request for this issue: https://github.com/apache/spark/pull/2446 More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099476#comment-14099476 ] Larry Xiao commented on SPARK-1987: --- ok. I understand. I'll try to implement it More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088953#comment-14088953 ] Larry Xiao commented on SPARK-1987: --- Hi Ankur What do you mean by custom sort routine? and parallel arrays? I understand the overhead of JVM objects, so do you want to use 3 separate primitivevector of srcID, dstId and Attr? I think I can implement EdgePartitionBuilder using three arrays. Some concern: Will this harm locality? (I also noticed in EdgePartition, srcID, dstId and Attr are stored in three arrays) BTW. I came across Storage Strategies for Collections in Dynamically Typed Languages, I think maybe this can be solved in JVM. More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org