[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/20947 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...
Github user TRANTANKHOA commented on a diff in the pull request: https://github.com/apache/spark/pull/20947#discussion_r178444354 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1593,7 +1596,9 @@ class Dataset[T] private[sql]( def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { val colNames: Seq[String] = col1 +: cols RelationalGroupedDataset( - toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType) + toDF(), + colNames.distinct.map(colName => resolve(colName)), --- End diff -- Yes, this ticket is only about making this behavior change. So the question really is if the team think that it is the expected behavior. I personally find it helps eliminate bugs in our ETL. I don't think anyone need to have duplicated columns in their grouped data set by intention. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20947#discussion_r178324135 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1593,7 +1596,9 @@ class Dataset[T] private[sql]( def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { val colNames: Seq[String] = col1 +: cols RelationalGroupedDataset( - toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType) + toDF(), + colNames.distinct.map(colName => resolve(colName)), --- End diff -- This will cause a behavior change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/20947 [SPARK-23705][SQL]Handle non-distinct columns in DataSet.groupBy ## What changes were proposed in this pull request? If input columns to DataSet.groupBy contains non unique columns, remove those columns ## How was this patch tested? Added unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_FIX_SPARK-23705 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20947.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20947 commit eb93f0590f47227a16055f7eea6bd1e906dec3c9 Author: vinodkc Date: 2018-03-30T15:53:49Z Handle non-distinct columns in groupBy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org