[GitHub] spark pull request #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropd...

2016-10-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15427


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropd...

2016-10-12 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15427#discussion_r83140093
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1878,17 +1878,25 @@ class Dataset[T] private[sql](
   def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
 val resolver = sparkSession.sessionState.analyzer.resolver
 val allColumns = queryExecution.analyzed.output
-val groupCols = colNames.map { colName =>
-  allColumns.find(col => resolver(col.name, colName)).getOrElse(
+val groupCols = colNames.flatMap { colName =>
+  // It is possibly there are more than one columns with the same name,
+  // so we call filter instead of find.
+  val cols = allColumns.filter(col => resolver(col.name, colName))
+  if (cols.isEmpty) {
 throw new AnalysisException(
--- End diff --

My thought is:

When an user mistakenly gives wrong column to `Dataset.drop`, it can be 
easily found out.

But for `Dataset.dropDuplicates`, it might be harder to figure out 
duplicate rows are still there. So to throw an explicit exception looks more 
proper to me. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropd...

2016-10-10 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/15427

[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates

## What changes were proposed in this pull request?

Two issues regarding Dataset.dropduplicates:

1. Dataset.dropDuplicates should consider the columns with same column name

We find and get the first resolved attribute from output with the given 
column name in `Dataset.dropDuplicates`. When we have the more than one columns 
with the same name. Other columns are put into aggregation columns, instead of 
grouping columns.

2. Dataset.dropDuplicates should not change the output of child plan

We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. 
However it causes problem when we want to select the columns as follows:

val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
// ds("_2") will cause analysis exception
ds.dropDuplicates("_1").select(ds("_1").as[String], 
ds("_2").as[Int])


Because the two issues are both related to `Dataset.dropduplicates` and the 
code changes are not big, so submitting them together as one PR.

## How was this patch tested?

Jenkins tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-dropduplicates

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15427


commit dd6405c003ea082b1c614f2efed4d1bcb2d6f5b9
Author: Liang-Chi Hsieh 
Date:   2016-10-11T06:08:44Z

Fix Dataset.dropduplicates.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org