Glenn Strycker created SPARK-11387:
--------------------------------------

             Summary: minimize shuffles during joins by using existing 
partitions and bundling messages
                 Key: SPARK-11387
                 URL: https://issues.apache.org/jira/browse/SPARK-11387
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Glenn Strycker


Currently an RDD join in Spark requires repartitioning by the join key (for 
large RDDs that cannot use broadcast).

This is very bad for highly skewed data, as every row containing a particular 
key will end up on one node.

Additionally, repartitioning is expensive, and the existing partitioning scheme 
may have been optimized to minimize message passing.  For example, perhaps an 
RDD is an edge list for a graph, but a user has already partitioned this data 
by a community structure or connected components, ensuring that similar edges 
are on the same partition.  Using a join operation to perform message passing 
will require repartitioning the edge list by the first or second vertex in the 
edge as a key.

Instead of repartitioning and shuffling, could messages across partitions be 
"bundled" together and passed once, almost like a broadcast operation?

Essentially the request here is to treat ALL RDDs of any size as 
broadcast-capable, and each partition would be broadcast one and at a time and 
the results aggregated.  It would be up to the user to optimize the 
partitioning to minimize the between-partition message passing volume.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to