[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-28 Thread JulienPeloton
Github user JulienPeloton commented on a diff in the pull request:

https://github.com/apache/spark/pull/23025#discussion_r237225273
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -767,6 +767,14 @@ setMethod("repartition",
 #'  using \code{spark.sql.shuffle.partitions} as 
number of partitions.}
 #'}
 #'
+#' At least one partition-by expression must be specified.
--- End diff --

@felixcheung Thanks, I did not know about this strict doc formatting rule 
in R.

@srowen Thanks for taking care of the fix!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

2018-11-19 Thread JulienPeloton
Github user JulienPeloton commented on the issue:

https://github.com/apache/spark/pull/23025
  
Thanks all for the reviews!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-18 Thread JulienPeloton
Github user JulienPeloton commented on a diff in the pull request:

https://github.com/apache/spark/pull/23025#discussion_r234512932
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols):
 At least one partition-by expression must be specified.
 When no explicit sort order is specified, "ascending nulls first" 
is assumed.
 
+Note that due to performance reasons this method uses sampling to 
estimate the ranges.
--- End diff --

Oh right, I missed it! Pushed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-18 Thread JulienPeloton
Github user JulienPeloton commented on a diff in the pull request:

https://github.com/apache/spark/pull/23025#discussion_r234509708
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2789,6 +2789,12 @@ class Dataset[T] private[sql](
* When no explicit sort order is specified, "ascending nulls first" is 
assumed.
* Note, the rows are not sorted in each partition of the resulting 
Dataset.
*
+   *
+   * Note that due to performance reasons this method uses sampling to 
estimate the ranges.
+   * Hence, the output may not be consistent, since sampling can return 
different values.
+   * The sample size can be controlled by setting the value of the 
parameter
+   * `spark.sql.execution.rangeExchange.sampleSizePerPartition`.
--- End diff --

@cloud-fan the sentence has been changed according to your suggestion (in 
both Spark & PySpark).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...

2018-11-15 Thread JulienPeloton
Github user JulienPeloton commented on the issue:

https://github.com/apache/spark/pull/23025
  
@viirya OK all references to SPARK-26024 removed from the doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-15 Thread JulienPeloton
Github user JulienPeloton commented on a diff in the pull request:

https://github.com/apache/spark/pull/23025#discussion_r234099934
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2813,6 +2819,11 @@ class Dataset[T] private[sql](
* When no explicit sort order is specified, "ascending nulls first" is 
assumed.
* Note, the rows are not sorted in each partition of the resulting 
Dataset.
*
+   * [SPARK-26024] Note that due to performance reasons this method uses 
sampling to
--- End diff --

Thanks. Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-15 Thread JulienPeloton
Github user JulienPeloton commented on a diff in the pull request:

https://github.com/apache/spark/pull/23025#discussion_r234099956
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2813,6 +2819,11 @@ class Dataset[T] private[sql](
* When no explicit sort order is specified, "ascending nulls first" is 
assumed.
* Note, the rows are not sorted in each partition of the resulting 
Dataset.
*
+   * [SPARK-26024] Note that due to performance reasons this method uses 
sampling to
+   * estimate the ranges. Hence, the output may not be consistent, since 
sampling can return
+   * different values. The sample size can be controlled by setting the 
value of the parameter
+   * {{spark.sql.execution.rangeExchange.sampleSizePerPartition}}.
--- End diff --

Thanks. Done.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-13 Thread JulienPeloton
GitHub user JulienPeloton opened a pull request:

https://github.com/apache/spark/pull/23025

[SPARK-26024][SQL]: Update documentation for repartitionByRange

Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), 
I noticed the number of elements in each partition after repartitioning using 
`df.repartitionByRange` can vary for the same setup:

```scala
// Shuffle numbers from 0 to 1000, and make a DataFrame
val df = Random.shuffle(0.to(1000)).toDF("val")

// Repartition it using 3 partitions
// Sum up number of elements in each partition, and collect it.
// And do it several times
for (i <- 0 to 9) {
  var counts = df.repartitionByRange(3, col("val"))
.mapPartitions{part => Iterator(part.size)}
.collect()
  println(counts.toList)
}
// -> the number of elements in each partition varies
```

This is expected as for performance reasons this method uses sampling to 
estimate the ranges (with default size of 100). Hence, the output may not be 
consistent, since sampling can return different values. But documentation was 
not mentioning it at all, leading to misunderstanding.

## What changes were proposed in this pull request?

Update the documentation (Spark & PySpark) to mention the impact of 
`spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting 
partitioned DataFrame.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JulienPeloton/spark SPARK-26024

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23025.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23025


commit b47b6d0eb207021dbde38c35108fec0e39a64332
Author: Julien 
Date:   2018-11-13T20:49:53Z

Update documentation on repartitionByRange according to SPARK-26024 (Spark)

commit 5a50282959c065f9797ce075239c30edeece4fbe
Author: Julien 
Date:   2018-11-13T20:50:38Z

Update documentation on repartitionByRange according to SPARK-26024 
(PySpark)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org