[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395408#comment-16395408 ] Josiah Berkebile commented on SPARK-5997: - Maybe it's worth mentioning that I've had to do this for several NLP-related Spark jobs because of the amount of RAM it consumes to build-out the parsing tree. Increasing the partition count proportional to the total row count so that each partition had roughly 'X' number of rows both increased parallelism over the cluster and also allowed me to reduce pressure on the RAM/Heap utilization. In these sorts of scenarios, exact balance across the partitions isn't of critical importance, so performing a shuffle just to maintain balance is detrimental to the overall job performance. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josiah Berkebile updated SPARK-5997: Comment: was deleted (was: This functionality can be useful when re-partitioning to output to a storage system like HDFS that doesn't handle large amounts of small files well. I've had a few Spark applications I've written where I've had to fan-out the number of partitions to reduce stress on the cluster for heavy calculations, and then reduce that number back down to one partition per executor to reduce the number of output files to HDFS in order to keep the block-count down. However, I can't think of a scenario where it would make sense to reduce the number of partitions to be less than the number of executors. If someone needs to do this, I think it would be more appropriate to schedule a job to sweep-up after Spark and concatenate its output into a single file.) > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395405#comment-16395405 ] Josiah Berkebile commented on SPARK-5997: - This functionality can be useful when re-partitioning to output to a storage system like HDFS that doesn't handle large amounts of small files well. I've had a few Spark applications I've written where I've had to fan-out the number of partitions to reduce stress on the cluster for heavy calculations, and then reduce that number back down to one partition per executor to reduce the number of output files to HDFS in order to keep the block-count down. However, I can't think of a scenario where it would make sense to reduce the number of partitions to be less than the number of executors. If someone needs to do this, I think it would be more appropriate to schedule a job to sweep-up after Spark and concatenate its output into a single file. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org