[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...

2018-10-17 Thread vinodkc
Github user vinodkc closed the pull request at:

https://github.com/apache/spark/pull/20947


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...

2018-03-31 Thread TRANTANKHOA
Github user TRANTANKHOA commented on a diff in the pull request:

https://github.com/apache/spark/pull/20947#discussion_r178444354
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1593,7 +1596,9 @@ class Dataset[T] private[sql](
   def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
 val colNames: Seq[String] = col1 +: cols
 RelationalGroupedDataset(
-  toDF(), colNames.map(colName => resolve(colName)), 
RelationalGroupedDataset.GroupByType)
+  toDF(),
+  colNames.distinct.map(colName => resolve(colName)),
--- End diff --

Yes, this ticket is only about making this behavior change. So the question 
really is if the team think that it is the expected behavior. I personally find 
it helps eliminate bugs in our ETL. I don't think anyone need to have 
duplicated columns in their grouped data set by intention. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...

2018-03-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20947#discussion_r178324135
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1593,7 +1596,9 @@ class Dataset[T] private[sql](
   def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
 val colNames: Seq[String] = col1 +: cols
 RelationalGroupedDataset(
-  toDF(), colNames.map(colName => resolve(colName)), 
RelationalGroupedDataset.GroupByType)
+  toDF(),
+  colNames.distinct.map(colName => resolve(colName)),
--- End diff --

This will cause a behavior change. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...

2018-03-30 Thread vinodkc
GitHub user vinodkc opened a pull request:

https://github.com/apache/spark/pull/20947

[SPARK-23705][SQL]Handle non-distinct columns in DataSet.groupBy

## What changes were proposed in this pull request?

If input columns to DataSet.groupBy contains non unique columns, remove 
those columns

## How was this patch tested?
Added unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vinodkc/spark br_FIX_SPARK-23705

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20947.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20947


commit eb93f0590f47227a16055f7eea6bd1e906dec3c9
Author: vinodkc 
Date:   2018-03-30T15:53:49Z

Handle non-distinct columns in groupBy




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org