Glenn Strycker created SPARK-11004:
--------------------------------------

             Summary: MapReduce Hive-like join operations for RDDs
                 Key: SPARK-11004
                 URL: https://issues.apache.org/jira/browse/SPARK-11004
             Project: Spark
          Issue Type: New Feature
            Reporter: Glenn Strycker


Could a feature be added to Spark that would use disk-only MapReduce operations 
for the very largest RDD joins?

MapReduce is able to handle incredibly large table joins in a stable, 
predictable way with gracious failures and recovery.  I have applications that 
are able to join 2 tables without error in Hive, but these same tables, when 
converted into RDDs, are unable to join in Spark (I am using the same cluster, 
and have played around with all of the memory configurations, persisting to 
disk, checkpointing, etc., and the RDDs are just too big for Spark on my 
cluster)

So, Spark is usually able to handle fairly large RDD joins, but occasionally 
runs into problems when the tables are just too big (e.g. the notorious 2GB 
shuffle limit issue, memory problems, etc.)  There are so many parameters to 
adjust (number of partitions, number of cores, memory per core, etc.) that it 
is difficult to guarantee stability on a shared cluster (say, running Yarn) 
with other jobs.

Could a feature be added to Spark that would use disk-only MapReduce commands 
to do very large joins?

That is, instead of myRDD1.join(myRDD2), we would have a special operation 
myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
MapReduce, and then convert the results of the join back into a standard RDD.

This might add stability for Spark jobs that deal with extremely large data, 
and enable developers to mix-and-max some Spark and MapReduce operations in the 
same program, rather than writing Hive scripts and stringing together Spark and 
MapReduce programs, which has extremely large overhead to convert RDDs to Hive 
tables and back again.

Despite memory-level operations being where most of Spark's speed gains lie, 
sometimes using disk-only may help with stability!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to