[ 
https://issues.apache.org/jira/browse/SPARK-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979070#comment-14979070
 ] 

Glenn Strycker commented on SPARK-11387:
----------------------------------------

This ticket may be a particular implementation idea that would address the 
request for general skewed joins

> minimize shuffles during joins by using existing partitions and bundling 
> messages
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-11387
>                 URL: https://issues.apache.org/jira/browse/SPARK-11387
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Glenn Strycker
>
> Currently an RDD join in Spark requires repartitioning by the join key (for 
> large RDDs that cannot use broadcast).
> This is very bad for highly skewed data, as every row containing a particular 
> key will end up on one node.
> Additionally, repartitioning is expensive, and the existing partitioning 
> scheme may have been optimized to minimize message passing.  For example, 
> perhaps an RDD is an edge list for a graph, but a user has already 
> partitioned this data by a community structure or connected components, 
> ensuring that similar edges are on the same partition.  Using a join 
> operation to perform message passing will require repartitioning the edge 
> list by the first or second vertex in the edge as a key.
> Instead of repartitioning and shuffling, could messages across partitions be 
> "bundled" together and passed once, almost like a broadcast operation?
> Essentially the request here is to treat ALL RDDs of any size as 
> broadcast-capable, and each partition would be broadcast one and at a time 
> and the results aggregated.  It would be up to the user to optimize the 
> partitioning to minimize the between-partition message passing volume.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to