[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462805#comment-17462805 ] Nicholas Chammas commented on SPARK-5997: - [~tenstriker] - I believe in your case you should be able to set {{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle the data but it will still split up your output across multiple files. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332011#comment-17332011 ] F. H. commented on SPARK-5997: -- Is there any solution to this issue? I'd like to split my dataset by one key and then further split each partition s.t. I reach a certain number of partitions. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792964#comment-16792964 ] nirav patel commented on SPARK-5997: Adding another possible use case for this ask - I am hitting IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to write unpartitioned Dataframe to parquet. Error is due to shuffleblock exceed 2GB in size. Solution is to repartition the Dataframe (Dataset) . I can do it but I don't want to cause shuffle when I increase number of partitions with repartition API. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395408#comment-16395408 ] Josiah Berkebile commented on SPARK-5997: - Maybe it's worth mentioning that I've had to do this for several NLP-related Spark jobs because of the amount of RAM it consumes to build-out the parsing tree. Increasing the partition count proportional to the total row count so that each partition had roughly 'X' number of rows both increased parallelism over the cluster and also allowed me to reduce pressure on the RAM/Heap utilization. In these sorts of scenarios, exact balance across the partitions isn't of critical importance, so performing a shuffle just to maintain balance is detrimental to the overall job performance. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395405#comment-16395405 ] Josiah Berkebile commented on SPARK-5997: - This functionality can be useful when re-partitioning to output to a storage system like HDFS that doesn't handle large amounts of small files well. I've had a few Spark applications I've written where I've had to fan-out the number of partitions to reduce stress on the cluster for heavy calculations, and then reduce that number back down to one partition per executor to reduce the number of output files to HDFS in order to keep the block-count down. However, I can't think of a scenario where it would make sense to reduce the number of partitions to be less than the number of executors. If someone needs to do this, I think it would be more appropriate to schedule a job to sweep-up after Spark and concatenate its output into a single file. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696016#comment-15696016 ] holdenk commented on SPARK-5997: That could work, although we'd probably want a different API and we'd need to be clear that the result doesn't have a known partitioner. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259101#comment-15259101 ] Robert Ormandi commented on SPARK-5997: --- Does it simply solve the problem if the method split up each partition to N new ones uniformly. In this way, we will have N x originalNumberOfPartitions partitions each containing originalNumberOfObjectsPerPartition / N objects approximately? N could be a parameter of the method. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035156#comment-15035156 ] Jaeboo Jung commented on SPARK-5997: This is good idea to increase parallelism without shuffle. But the important thing is we cannot avoid losing balance of data size among the partitions. It may be hard to choose what partition should be divided. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603543#comment-14603543 ] Perinkulam I Ganesh commented on SPARK-5997: Hi .. New to Spark .. maybe speaking out of ignorance .. won't the following function accomplish the above ? The approach is simple.. If the number of partition requested is higher ...collapse the current number of partitions into a single partition with no shuffle. Now, with only one partition,repartition with shuffle or no-shuffle will behave identically. So use this single partition with shuffle and split the partition into higher partition. def repartitionv2(numPartitions: Int, shuffle: Boolean)(implicit ord: Ordering[T] = null) : RDD[T] = withScope { if (shuffle) { coalesce(numPartitions, shuffle) } else { var cnt = getPartitions.size; if (numPartitions cnt) { var temp = coalesce(1, false) temp.coalesce(numPartitions, true) } else { coalesce(numPartitions, shuffle) } } } It seems to work ... thanks - P. I. Increase partition count without performing a shuffle - Key: SPARK-5997 URL: https://issues.apache.org/jira/browse/SPARK-5997 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Andrew Ash When decreasing partition count with rdd.repartition() or rdd.coalesce(), the user has the ability to choose whether or not to perform a shuffle. However when increasing partition count there is no option of whether to perform a shuffle or not -- a shuffle always occurs. This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that performs a repartition to a higher partition count without a shuffle. The motivating use case is to decrease the size of an individual partition enough that the .toLocalIterator has significantly reduced memory pressure on the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603985#comment-14603985 ] Sean Owen commented on SPARK-5997: -- Coalescing to 1 partition doesn't avoid the problem, since (unless you started with 1 executor) most partitions are remote anyway and have to be collected to 1 machine. This isn't a good thing. Increase partition count without performing a shuffle - Key: SPARK-5997 URL: https://issues.apache.org/jira/browse/SPARK-5997 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Andrew Ash When decreasing partition count with rdd.repartition() or rdd.coalesce(), the user has the ability to choose whether or not to perform a shuffle. However when increasing partition count there is no option of whether to perform a shuffle or not -- a shuffle always occurs. This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that performs a repartition to a higher partition count without a shuffle. The motivating use case is to decrease the size of an individual partition enough that the .toLocalIterator has significantly reduced memory pressure on the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org