[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-08-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r208479881
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -371,25 +426,14 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* (Java-specific) Pivots a column of the current `DataFrame` and 
performs the specified
-   * aggregation.
-   *
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
-   * of distinct values to pivot on, and one that does not. The latter is 
more concise but less
-   * efficient, because Spark needs to first compute the list of distinct 
values internally.
+   * aggregation. This is an overloaded version of the `pivot` method with 
`pivotColumn` of
+   * the `String` type.
*
-   * {{{
-   *   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy("year").pivot("course", Arrays.asList("dotNET", 
"Java")).sum("earnings");
-   *
-   *   // Or without specifying column values (less efficient)
-   *   df.groupBy("year").pivot("course").sum("earnings");
-   * }}}
-   *
-   * @param pivotColumn Name of the column to pivot.
+   * @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
-   * @since 1.6.0
+   * @since 2.4.0
*/
-  def pivot(pivotColumn: String, values: java.util.List[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: java.util.List[Any]): 
RelationalGroupedDataset = {
--- End diff --

If there's a plan for auditing it in 3.0.0, I am okay with going ahead with 
`Column` but thing is, we should deprecate them first.

For the current status, I think the problem here is, this is an overloaded 
version of pivot and wouldn't necessarily make the differences.

I used `pivot` heavily in the previous company before and I am pretty sure 
`pivot(String, actual values)` has been a common pattern and usage so far.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-08-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r208478196
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -371,25 +426,14 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* (Java-specific) Pivots a column of the current `DataFrame` and 
performs the specified
-   * aggregation.
-   *
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
-   * of distinct values to pivot on, and one that does not. The latter is 
more concise but less
-   * efficient, because Spark needs to first compute the list of distinct 
values internally.
+   * aggregation. This is an overloaded version of the `pivot` method with 
`pivotColumn` of
+   * the `String` type.
*
-   * {{{
-   *   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy("year").pivot("course", Arrays.asList("dotNET", 
"Java")).sum("earnings");
-   *
-   *   // Or without specifying column values (less efficient)
-   *   df.groupBy("year").pivot("course").sum("earnings");
-   * }}}
-   *
-   * @param pivotColumn Name of the column to pivot.
+   * @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
-   * @since 1.6.0
+   * @since 2.4.0
*/
-  def pivot(pivotColumn: String, values: java.util.List[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: java.util.List[Any]): 
RelationalGroupedDataset = {
--- End diff --

I think it's a bad idea to use `Any` in the API. For the existing ones we 
can't remove, but why should not add new ones using `Any`.

In Spark 3.0 we should audit all the APIs in `functions.scala` and make 
them consistent(e.g. only use `Column` in the API)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21699


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-08-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r207088444
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* Pivots a column of the current `DataFrame` and performs the specified 
aggregation.
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
-   * of distinct values to pivot on, and one that does not. The latter is 
more concise but less
-   * efficient, because Spark needs to first compute the list of distinct 
values internally.
+   * This is an overloaded version of the `pivot` method with 
`pivotColumn` of the `String` type.
*
* {{{
*   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy("year").pivot("course", Seq("dotNET", 
"Java")).sum("earnings")
-   *
-   *   // Or without specifying column values (less efficient)
-   *   df.groupBy("year").pivot("course").sum("earnings")
+   *   df.groupBy($"year").pivot($"course", Seq("dotNET", 
"Java")).sum($"earnings")
* }}}
*
-   * @param pivotColumn Name of the column to pivot.
+   * @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
-   * @since 1.6.0
+   * @since 2.4.0
*/
-  def pivot(pivotColumn: String, values: Seq[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Any]): 
RelationalGroupedDataset = {
+import org.apache.spark.sql.functions.struct
 groupType match {
   case RelationalGroupedDataset.GroupByType =>
+val pivotValues = values.map {
--- End diff --

I hope the last commit is reverted and we go ahead orthogonally if 
@maryannxue is happy with that too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-08-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r207088374
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala ---
@@ -246,4 +267,77 @@ class DataFramePivotSuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(df.select($"a".cast(StringType)), Row(tsWithZone))
 }
   }
+
+  test("SPARK-24722: pivoting nested columns") {
--- End diff --

I usually leave a JIRA number for one specific regression test when it's a 
bug since that's going to explicitly point out which case was broken .. but not 
a big deal though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-08-01 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r206802439
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* Pivots a column of the current `DataFrame` and performs the specified 
aggregation.
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
-   * of distinct values to pivot on, and one that does not. The latter is 
more concise but less
-   * efficient, because Spark needs to first compute the list of distinct 
values internally.
+   * This is an overloaded version of the `pivot` method with 
`pivotColumn` of the `String` type.
*
* {{{
*   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy("year").pivot("course", Seq("dotNET", 
"Java")).sum("earnings")
-   *
-   *   // Or without specifying column values (less efficient)
-   *   df.groupBy("year").pivot("course").sum("earnings")
+   *   df.groupBy($"year").pivot($"course", Seq("dotNET", 
"Java")).sum($"earnings")
* }}}
*
-   * @param pivotColumn Name of the column to pivot.
+   * @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
-   * @since 1.6.0
+   * @since 2.4.0
*/
-  def pivot(pivotColumn: String, values: Seq[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Any]): 
RelationalGroupedDataset = {
+import org.apache.spark.sql.functions.struct
 groupType match {
   case RelationalGroupedDataset.GroupByType =>
+val pivotValues = values.map {
--- End diff --

@HyukjinKwon Should I revert the last commit and propose it as a separate 
PR? I think it makes sense to discuss in JIRA ticket possible alternatives for 
API. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-07-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r206732588
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* Pivots a column of the current `DataFrame` and performs the specified 
aggregation.
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
-   * of distinct values to pivot on, and one that does not. The latter is 
more concise but less
-   * efficient, because Spark needs to first compute the list of distinct 
values internally.
+   * This is an overloaded version of the `pivot` method with 
`pivotColumn` of the `String` type.
*
* {{{
*   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy("year").pivot("course", Seq("dotNET", 
"Java")).sum("earnings")
-   *
-   *   // Or without specifying column values (less efficient)
-   *   df.groupBy("year").pivot("course").sum("earnings")
+   *   df.groupBy($"year").pivot($"course", Seq("dotNET", 
"Java")).sum($"earnings")
* }}}
*
-   * @param pivotColumn Name of the column to pivot.
+   * @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
-   * @since 1.6.0
+   * @since 2.4.0
*/
-  def pivot(pivotColumn: String, values: Seq[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Any]): 
RelationalGroupedDataset = {
+import org.apache.spark.sql.functions.struct
 groupType match {
   case RelationalGroupedDataset.GroupByType =>
+val pivotValues = values.map {
--- End diff --

This PR just propose an overloaded version, pivot(column) of pivot(string). 
It's not necessary to fix other things together. Also, it need another review 
iteration I guess. For instance, does array or map work and nested struct, etc. 
Let's take out this change for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-07-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r206732345
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -339,29 +400,30 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* Pivots a column of the current `DataFrame` and performs the specified 
aggregation.
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
-   * of distinct values to pivot on, and one that does not. The latter is 
more concise but less
-   * efficient, because Spark needs to first compute the list of distinct 
values internally.
+   * This is an overloaded version of the `pivot` method with 
`pivotColumn` of the `String` type.
*
* {{{
*   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy("year").pivot("course", Seq("dotNET", 
"Java")).sum("earnings")
-   *
-   *   // Or without specifying column values (less efficient)
-   *   df.groupBy("year").pivot("course").sum("earnings")
+   *   df.groupBy($"year").pivot($"course", Seq("dotNET", 
"Java")).sum($"earnings")
* }}}
*
-   * @param pivotColumn Name of the column to pivot.
+   * @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
-   * @since 1.6.0
+   * @since 2.4.0
*/
-  def pivot(pivotColumn: String, values: Seq[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Any]): 
RelationalGroupedDataset = {
+import org.apache.spark.sql.functions.struct
 groupType match {
   case RelationalGroupedDataset.GroupByType =>
+val pivotValues = values.map {
--- End diff --

Hm? wait @maryannxue I think we shouldn't do this at least here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-07-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r205669387
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -340,29 +399,23 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* Pivots a column of the current `DataFrame` and performs the specified 
aggregation.
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
--- End diff --

Shall we note this in `Column` API too, or note that this is an overloaded 
version of string's?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-07-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21699#discussion_r199742580
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -340,36 +340,52 @@ class RelationalGroupedDataset protected[sql](
 
   /**
* Pivots a column of the current `DataFrame` and performs the specified 
aggregation.
-   * There are two versions of pivot function: one that requires the 
caller to specify the list
-   * of distinct values to pivot on, and one that does not. The latter is 
more concise but less
-   * efficient, because Spark needs to first compute the list of distinct 
values internally.
*
* {{{
*   // Compute the sum of earnings for each year by course with each 
course as a separate column
-   *   df.groupBy("year").pivot("course", Seq("dotNET", 
"Java")).sum("earnings")
-   *
-   *   // Or without specifying column values (less efficient)
-   *   df.groupBy("year").pivot("course").sum("earnings")
+   *   df.groupBy($"year").pivot($"course", Seq("dotNET", 
"Java")).sum($"earnings")
* }}}
*
-   * @param pivotColumn Name of the column to pivot.
+   * @param pivotColumn the column to pivot.
* @param values List of values that will be translated to columns in 
the output DataFrame.
-   * @since 1.6.0
+   * @since 2.4.0
*/
-  def pivot(pivotColumn: String, values: Seq[Any]): 
RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Any]): 
RelationalGroupedDataset = {
--- End diff --

To make diffs smaller, can you move this under the signature `def 
pivot(pivotColumn: String, values: Seq[Any])`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21699: [SPARK-24722][SQL] pivot() with Column type argum...

2018-07-02 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21699

[SPARK-24722][SQL] pivot() with Column type argument

## What changes were proposed in this pull request?

In the PR, I propose column-based API for the `pivot()` function. It allows 
using of nested columns as the pivot column. Also this makes it consistent with 
how groupBy() works.

## How was this patch tested?

I added new tests to `DataFramePivotSuite` and updated PySpark examples for 
the `pivot()` function.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 pivot-column

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21699.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21699


commit 889e9223510c821f359ee3ce5bec6ce2f746a027
Author: Maxim Gekk 
Date:   2018-07-02T17:09:19Z

Adding pivot() which takes Column as its argument

commit f736ea2bca27fee37281bcadb333e7b6bbcd6124
Author: Maxim Gekk 
Date:   2018-07-02T17:21:07Z

Tests for new function

commit 5e6822650f4781c343d477589bf252c37b8453c4
Author: Maxim Gekk 
Date:   2018-07-02T18:24:18Z

the since tag is updated

commit c82c3979065aba48536a743ebf3384f3c95b570c
Author: Maxim Gekk 
Date:   2018-07-02T19:05:48Z

Test for nested columns

commit 7d0d2261cef4c66226cd59635603391faabf0046
Author: Maxim Gekk 
Date:   2018-07-02T20:38:02Z

Python test for nested columns

commit 0fdd11ff26b4f4ca3b79bdd116aaf1c558643698
Author: Maxim Gekk 
Date:   2018-07-02T21:02:58Z

Adding ticket number to test's title




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org