[jira] [Issue Comment Deleted] (SPARK-5997) Increase partition count without performing a shuffle

Josiah Berkebile (JIRA) Mon, 12 Mar 2018 08:38:15 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josiah Berkebile updated SPARK-5997:
------------------------------------
    Comment: was deleted

(was: This functionality can be useful when re-partitioning to output to a 
storage system like HDFS that doesn't handle large amounts of small files well. 
 I've had a few Spark applications I've written where I've had to fan-out the 
number of partitions to reduce stress on the cluster for heavy calculations, 
and then reduce that number back down to one partition per executor to reduce 
the number of output files to HDFS in order to keep the block-count down.

However, I can't think of a scenario where it would make sense to reduce the 
number of partitions to be less than the number of executors.  If someone needs 
to do this, I think it would be more appropriate to schedule a job to sweep-up 
after Spark and concatenate its output into a single file.)

> Increase partition count without performing a shuffle
> -----------------------------------------------------
>
>                 Key: SPARK-5997
>                 URL: https://issues.apache.org/jira/browse/SPARK-5997
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Andrew Ash
>            Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5997) Increase partition count without performing a shuffle

Reply via email to