[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395408#comment-16395408
 ] 

Josiah Berkebile commented on SPARK-5997:
-

Maybe it's worth mentioning that I've had to do this for several NLP-related 
Spark jobs because of the amount of RAM it consumes to build-out the parsing 
tree.  Increasing the partition count proportional to the total row count so 
that each partition had roughly 'X' number of rows both increased parallelism 
over the cluster and also allowed me to reduce pressure on the RAM/Heap 
utilization.  In these sorts of scenarios, exact balance across the partitions 
isn't of critical importance, so performing a shuffle just to maintain balance 
is detrimental to the overall job performance.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josiah Berkebile updated SPARK-5997:

Comment: was deleted

(was: This functionality can be useful when re-partitioning to output to a 
storage system like HDFS that doesn't handle large amounts of small files well. 
 I've had a few Spark applications I've written where I've had to fan-out the 
number of partitions to reduce stress on the cluster for heavy calculations, 
and then reduce that number back down to one partition per executor to reduce 
the number of output files to HDFS in order to keep the block-count down.

However, I can't think of a scenario where it would make sense to reduce the 
number of partitions to be less than the number of executors.  If someone needs 
to do this, I think it would be more appropriate to schedule a job to sweep-up 
after Spark and concatenate its output into a single file.)

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395405#comment-16395405
 ] 

Josiah Berkebile commented on SPARK-5997:
-

This functionality can be useful when re-partitioning to output to a storage 
system like HDFS that doesn't handle large amounts of small files well.  I've 
had a few Spark applications I've written where I've had to fan-out the number 
of partitions to reduce stress on the cluster for heavy calculations, and then 
reduce that number back down to one partition per executor to reduce the number 
of output files to HDFS in order to keep the block-count down.

However, I can't think of a scenario where it would make sense to reduce the 
number of partitions to be less than the number of executors.  If someone needs 
to do this, I think it would be more appropriate to schedule a job to sweep-up 
after Spark and concatenate its output into a single file.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org