[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2021-12-20 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462805#comment-17462805
 ] 

Nicholas Chammas commented on SPARK-5997:
-

[~tenstriker] - I believe in your case you should be able to set 
{{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle 
the data but it will still split up your output across multiple files.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2021-04-26 Thread F. H. (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332011#comment-17332011
 ] 

F. H. commented on SPARK-5997:
--

Is there any solution to this issue?

I'd like to split my dataset by one key and then further split each partition 
s.t. I reach a certain number of partitions.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2019-03-14 Thread nirav patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792964#comment-16792964
 ] 

nirav patel commented on SPARK-5997:


Adding another possible use case for this ask - I am hitting 
IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to 
write unpartitioned Dataframe to parquet. Error is due to shuffleblock exceed 
2GB in size. Solution is to repartition the Dataframe (Dataset) . I can do it 
but I don't want to cause shuffle when I increase number of partitions with 
repartition API.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395408#comment-16395408
 ] 

Josiah Berkebile commented on SPARK-5997:
-

Maybe it's worth mentioning that I've had to do this for several NLP-related 
Spark jobs because of the amount of RAM it consumes to build-out the parsing 
tree.  Increasing the partition count proportional to the total row count so 
that each partition had roughly 'X' number of rows both increased parallelism 
over the cluster and also allowed me to reduce pressure on the RAM/Heap 
utilization.  In these sorts of scenarios, exact balance across the partitions 
isn't of critical importance, so performing a shuffle just to maintain balance 
is detrimental to the overall job performance.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395405#comment-16395405
 ] 

Josiah Berkebile commented on SPARK-5997:
-

This functionality can be useful when re-partitioning to output to a storage 
system like HDFS that doesn't handle large amounts of small files well.  I've 
had a few Spark applications I've written where I've had to fan-out the number 
of partitions to reduce stress on the cluster for heavy calculations, and then 
reduce that number back down to one partition per executor to reduce the number 
of output files to HDFS in order to keep the block-count down.

However, I can't think of a scenario where it would make sense to reduce the 
number of partitions to be less than the number of executors.  If someone needs 
to do this, I think it would be more appropriate to schedule a job to sweep-up 
after Spark and concatenate its output into a single file.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696016#comment-15696016
 ] 

holdenk commented on SPARK-5997:


That could work, although we'd probably want a different API and we'd need to 
be clear that the result doesn't have a known partitioner.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2016-04-26 Thread Robert Ormandi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259101#comment-15259101
 ] 

Robert Ormandi commented on SPARK-5997:
---

Does it simply solve the problem if the method split up each partition to N new 
ones uniformly. In this way, we will have N x originalNumberOfPartitions 
partitions each containing originalNumberOfObjectsPerPartition / N objects 
approximately? N could be a parameter of the method.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2015-12-01 Thread Jaeboo Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035156#comment-15035156
 ] 

Jaeboo Jung commented on SPARK-5997:


This is good idea to increase parallelism without shuffle. But the important 
thing is we cannot avoid losing balance of data size among the partitions. It 
may be hard to choose what partition should be divided.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2015-06-26 Thread Perinkulam I Ganesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603543#comment-14603543
 ] 

Perinkulam I Ganesh commented on SPARK-5997:


Hi .. 

New to Spark .. maybe speaking out of ignorance .. won't the following function 
accomplish the above ?

The approach is simple.. If the number of partition requested is higher 
...collapse the current number of partitions into a single partition with no 
shuffle. Now, with only one partition,repartition with shuffle or no-shuffle 
will behave identically. 

So use this single partition with shuffle and split the partition into higher 
partition.

  def repartitionv2(numPartitions: Int, shuffle: Boolean)(implicit ord: 
Ordering[T] = null)
  : RDD[T] = withScope {
if (shuffle) {
  coalesce(numPartitions, shuffle)
}
else {
  var cnt = getPartitions.size;
  if (numPartitions  cnt) {
var temp = coalesce(1, false)
temp.coalesce(numPartitions, true)
  }
  else {
coalesce(numPartitions, shuffle)
  }
}
  }

It seems to work ...

thanks

- P. I. 

 Increase partition count without performing a shuffle
 -

 Key: SPARK-5997
 URL: https://issues.apache.org/jira/browse/SPARK-5997
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Andrew Ash

 When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
 user has the ability to choose whether or not to perform a shuffle.  However 
 when increasing partition count there is no option of whether to perform a 
 shuffle or not -- a shuffle always occurs.
 This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
 that performs a repartition to a higher partition count without a shuffle.
 The motivating use case is to decrease the size of an individual partition 
 enough that the .toLocalIterator has significantly reduced memory pressure on 
 the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2015-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603985#comment-14603985
 ] 

Sean Owen commented on SPARK-5997:
--

Coalescing to 1 partition doesn't avoid the problem, since (unless you started 
with 1 executor) most partitions are remote anyway and have to be collected to 
1 machine. This isn't a good thing.

 Increase partition count without performing a shuffle
 -

 Key: SPARK-5997
 URL: https://issues.apache.org/jira/browse/SPARK-5997
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Andrew Ash

 When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
 user has the ability to choose whether or not to perform a shuffle.  However 
 when increasing partition count there is no option of whether to perform a 
 shuffle or not -- a shuffle always occurs.
 This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
 that performs a repartition to a higher partition count without a shuffle.
 The motivating use case is to decrease the size of an individual partition 
 enough that the .toLocalIterator has significantly reduced memory pressure on 
 the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org