[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user JulienPeloton commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r237225273 --- Diff: R/pkg/R/DataFrame.R --- @@ -767,6 +767,14 @@ setMethod("repartition", #' using \code{spark.sql.shuffle.partitions} as number of partitions.} #'} #' +#' At least one partition-by expression must be specified. --- End diff -- @felixcheung Thanks, I did not know about this strict doc formatting rule in R. @srowen Thanks for taking care of the fix! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user JulienPeloton commented on the issue: https://github.com/apache/spark/pull/23025 Thanks all for the reviews! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user JulienPeloton commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234512932 --- Diff: python/pyspark/sql/dataframe.py --- @@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols): At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. +Note that due to performance reasons this method uses sampling to estimate the ranges. --- End diff -- Oh right, I missed it! Pushed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user JulienPeloton commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234509708 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2789,6 +2789,12 @@ class Dataset[T] private[sql]( * When no explicit sort order is specified, "ascending nulls first" is assumed. * Note, the rows are not sorted in each partition of the resulting Dataset. * + * + * Note that due to performance reasons this method uses sampling to estimate the ranges. + * Hence, the output may not be consistent, since sampling can return different values. + * The sample size can be controlled by setting the value of the parameter + * `spark.sql.execution.rangeExchange.sampleSizePerPartition`. --- End diff -- @cloud-fan the sentence has been changed according to your suggestion (in both Spark & PySpark). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user JulienPeloton commented on the issue: https://github.com/apache/spark/pull/23025 @viirya OK all references to SPARK-26024 removed from the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user JulienPeloton commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234099934 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2813,6 +2819,11 @@ class Dataset[T] private[sql]( * When no explicit sort order is specified, "ascending nulls first" is assumed. * Note, the rows are not sorted in each partition of the resulting Dataset. * + * [SPARK-26024] Note that due to performance reasons this method uses sampling to --- End diff -- Thanks. Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user JulienPeloton commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234099956 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2813,6 +2819,11 @@ class Dataset[T] private[sql]( * When no explicit sort order is specified, "ascending nulls first" is assumed. * Note, the rows are not sorted in each partition of the resulting Dataset. * + * [SPARK-26024] Note that due to performance reasons this method uses sampling to + * estimate the ranges. Hence, the output may not be consistent, since sampling can return + * different values. The sample size can be controlled by setting the value of the parameter + * {{spark.sql.execution.rangeExchange.sampleSizePerPartition}}. --- End diff -- Thanks. Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
GitHub user JulienPeloton opened a pull request: https://github.com/apache/spark/pull/23025 [SPARK-26024][SQL]: Update documentation for repartitionByRange Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), I noticed the number of elements in each partition after repartitioning using `df.repartitionByRange` can vary for the same setup: ```scala // Shuffle numbers from 0 to 1000, and make a DataFrame val df = Random.shuffle(0.to(1000)).toDF("val") // Repartition it using 3 partitions // Sum up number of elements in each partition, and collect it. // And do it several times for (i <- 0 to 9) { var counts = df.repartitionByRange(3, col("val")) .mapPartitions{part => Iterator(part.size)} .collect() println(counts.toList) } // -> the number of elements in each partition varies ``` This is expected as for performance reasons this method uses sampling to estimate the ranges (with default size of 100). Hence, the output may not be consistent, since sampling can return different values. But documentation was not mentioning it at all, leading to misunderstanding. ## What changes were proposed in this pull request? Update the documentation (Spark & PySpark) to mention the impact of `spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting partitioned DataFrame. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JulienPeloton/spark SPARK-26024 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23025.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23025 commit b47b6d0eb207021dbde38c35108fec0e39a64332 Author: Julien Date: 2018-11-13T20:49:53Z Update documentation on repartitionByRange according to SPARK-26024 (Spark) commit 5a50282959c065f9797ce075239c30edeece4fbe Author: Julien Date: 2018-11-13T20:50:38Z Update documentation on repartitionByRange according to SPARK-26024 (PySpark) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org