[GitHub] spark issue #23108: [Spark-25993][SQL][TEST]Add test cases for CREATE EXTERN...

2018-12-06 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/23108
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-12-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22141
  
@maropu Very sorry. I haven't had the time to come back to it. I have some 
stuff on my plate. I will get to this after i am done. Thanks !!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23108: [Spark-25993][SQL][TEST]Add test cases for CREATE EXTERN...

2018-12-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/23108
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and R...

2018-12-04 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/23211
  
@wangyum Thanks.. Can you please tell me how you generate this ? Also, is 
it possible to get runtimes of these queries to see if there are any 
regressions ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...

2018-12-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22899
  
@gatorsmile Thanks a lot. I completely agree that we should try and combine 
these two. I will continue to think about it :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and R...

2018-12-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/23211
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...

2018-12-03 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/23211

[SPARK-19712][SQL] Move PullupCorrelatedPredicates and 
RewritePredicateSubquery after OptimizeSubqueries

Currently predicate subqueries (IN/EXISTS) are converted to Joins at the 
end of optimizer in RewritePredicateSubquery. This change moves the rewrite 
close to beginning of optimizer. The original idea was to keep the subquery 
expressions in Filter form so that we can push them down as deep as possible. 
One disadvantage is that, after the subqueries are rewritten in join form, they 
are not subjected to further optimizations. In this change, we convert the 
subqueries to join form early in the rewrite phase and then add logic to push 
the left-semi and left-anti joins down like we do for normal filter ops. I can 
think of the following advantages : 

1. We will produce consistent optimized plans for subqueries written using 
SQL dialect and data frame apis.
2. Will hopefully make it easier to do the next phase of de-correlations 
when we opens up more cases of de-correlation. In this case, it would be 
beneficial to expose the rewritten queries to all the other optimization rules.
3. We can now hopefully get-rid of PullupCorrelatedPredicates rule and 
combine ths with RewritePredicateSubquery. I haven't tried it. Will take it on 
a followup.

(P.S Thanks to Natt for his original work in 
[here](https://github.com/apache/spark/pull/17520). I have based this pr on his 
work)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-19712-NEW

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23211.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23211


commit f4bb126472eb5a808a3ae94bcfb59e0674e01217
Author: Dilip Biswal 
Date:   2018-12-03T22:06:24Z

[SPARK-19712] Move PullupCorrelatedPredicates and RewritePredicateSubquery 
after OptimizeSubqueries




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23170: [SPARK-24423][FOLLOW-UP][SQL] Fix error example

2018-11-30 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/23170
  
Looks good to me. Thanks for fixing this @wangyum. I guess, i tried to trim 
off the example to show the mutual exclusivity of these two parameters in 
question. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-11-12 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r232571944
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala
 ---
@@ -57,3 +57,34 @@ case class Max(child: Expression) extends 
DeclarativeAggregate {
 
   override lazy val evaluateExpression: AttributeReference = max
 }
+
+abstract class AnyAggBase(arg: Expression)
+  extends UnevaluableAggrgate with ImplicitCastInputTypes {
+
+  override def children: Seq[Expression] = arg :: Nil
+
+  override def dataType: DataType = BooleanType
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+arg.dataType match {
+  case dt if dt != BooleanType =>
+TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' 
should have been " +
+  s"${BooleanType.simpleString}, but it's 
[${arg.dataType.catalogString}].")
+  case _ => TypeCheckResult.TypeCheckSuccess
+}
+  }
+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is 
true.")
--- End diff --

@HyukjinKwon Sure.. I will open a pr shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...

2018-11-11 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22899
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...

2018-11-10 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22899
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r229797016
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out 
---
@@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1
 struct<>
 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
--- End diff --

@cloud-fan Makes sense. Thanks for clarification...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22899
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22900: [SPARK-25618][SQL][TEST] Reduce time taken to execute Ka...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22900
  
Oh...thank you very much @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22900: [SPARK-25618][SQL][TEST] Reduce time taken to exe...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22900#discussion_r229625779
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala
 ---
@@ -221,7 +221,7 @@ class KafkaSourceStressForDontFailOnDataLossSuite 
extends StreamTest with KafkaM
   .as[(String, String)]
 val query = startStream(kafka.map(kv => kv._2.toInt))
 
-val testTime = 1.minutes
+val testTime = 20.seconds
--- End diff --

@dongjoon-hyun Actually i did count the number of times we went inside each 
case block. With this change, we did a little less than we used to do before.. 
but if i remember, we did enough kafka operations. I will get the count of each 
case block tomorrow for your perusal.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22900: [SPARK-25618][SQL][TEST] Reduce time taken to execute Ka...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22900
  
cc @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22900: [SPARK-25618][SQL][TEST] Reduce time taken to execute Ka...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22900
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22895: [SPARK-25886][SQL][Minor] Improve error message of `Fail...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22895
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/21860
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22847
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22899
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r229578738
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out 
---
@@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1
 struct<>
 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
--- End diff --

@cloud-fan I had a question, the 3rd example from huaxin, wanted to confirm 
again if the output looks okay to you.

**from huaxin**
```

$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
```
with my previous fix:
```
`a`.`b`
``` 
make sql same as name:
```
a.b
```
Is it okay to drop the backtick from the first identifier ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r229577806
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out 
---
@@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1
 struct<>
 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
--- End diff --

@huaxingao Were you planning to make a change to make `name` same as `sql` 
? Currently they are different ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22900: [SPARK-25618][SQL][TEST] Reduce time taken to exe...

2018-10-31 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22900

[SPARK-25618][SQL][TEST] Reduce time taken to execute 
KafkaContinuousSourceStressForDontFailOnDataLossSuite

## What changes were proposed in this pull request?
In this test, i have reduced the test time to 20 secs from 1 minute while 
reducing the sleep time from 1 sec to 100 milliseconds. 

With this change, i was able to run the test in 20+ seconds consistently on 
my laptop. I would like see if it passes in jenkins consistently.

## How was this patch tested?
Its a test fix.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25618

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22900.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22900


commit 5a6907cd1dc0872d68214bc42f1b2898f9e48009
Author: Dilip Biswal 
Date:   2018-10-17T05:08:19Z

[SPARK-25618] Reduce time taken to execute 
KafkaContinuousSourceStressForDontFailOnDataLossSuite




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...

2018-10-31 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22899
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22899: [SPARK-25573] Combine resolveExpression and resol...

2018-10-31 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22899

[SPARK-25573] Combine resolveExpression and resolve in the Analyzer

## What changes were proposed in this pull request?
Currently in the Analyzer, we have two methods 1) Resolve 
2)ResolveExpressions that are called at different code paths to resolve 
attributes, column ordinal and extract value expressions. In this PR, we 
combine the two into one method to make sure, there is only one method that is 
tasked with resolving the attributes.

## How was this patch tested?
Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25573-final

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22899.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22899


commit ff375690c7a884c52fa4683116570a964e46a96d
Author: Dilip Biswal 
Date:   2018-10-31T05:36:30Z

[SPARK-25573] Combine resolveExpression and resolve in the Analyzer




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...

2018-10-30 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/17520
  
@gatorsmile Sure Sean.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-30 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229362181
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

@gatorsmile Sure Sean.. Let me give it a try.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229181821
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

@gatorsmile Do you think its a good time to revisit Natt's PR to convert 
subquery expressions to Joins early in the optimization process ?  Perhaps then 
we can take advantage of all the subsequent rules firing after the subquery 
rewrite ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229063489
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

@dongjoon-hyun oh.. thanks .. because it requires an explicit correlation ? 
Sorry, don't have  1.2.2 env to try out ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229062732
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

@dongjoon-hyun i was thinking, calling upper on a int column is probably 
not very intuitive :-)
What do you think about adding a string literal in the projection ?

```
SELECT c + 1, upper(d) FROM select 1 c, 'test' as d 
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229051205
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

nit : We could perhaps simplify the query to : 
```
CREATE VIEW v1 AS (SELECT c1 + 1, upper(c2) FROM t1);
```
what do you think ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22856: [SPARK-25856][SQL][MINOR] Remove AverageLike and CountLi...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22856
  
Thank you @srowen @dongjoon-hyun @mgaido91 @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-28 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r228766091
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries.
--- End diff --

@seancxmao Thanks for adding the doc. Can a small example here help 
illustrate this better ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22797: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-27 Thread dilipbiswal
Github user dilipbiswal closed the pull request at:

https://github.com/apache/spark/pull/22797


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...

2018-10-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22809
  
Thanks a lot @cloud-fan @mgaido91 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...

2018-10-27 Thread dilipbiswal
Github user dilipbiswal closed the pull request at:

https://github.com/apache/spark/pull/22047


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-27 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228725645
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+abstract class UnevaluableBooleanAggBase(arg: Expression)
--- End diff --

@cloud-fan @mgaido91 Thank you. I have added a TODO for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228666789
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+abstract class UnevaluableBooleanAggBase(arg: Expression)
--- End diff --

@mgaido91 I re-read your comments again and here is what i think you are 
implying : 
   1. In UnevaluableAggregate, define a method say `replacedExpression`
   2. The sub-class overrides it and return an actual rewritten expression
   3. In optimized, match UnevaluableAggregate and call `replacedExpression`

Did i understand it correctly ? If so, actually, i wouldn't prefer that 
way. The reason is, the replacedExpression is hidden from analyzer and so it 
may not be safe. The way ReplaceExpression framework is nicely designed, the 
analyzer resolves the rewritten expression normally (as its the child 
expression). Thats the reason, i have opted for specific/targeted rewrites. 

If Wenchen and you think otherwise, then i can change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22856: [SPARK-25856][SQL][MINOR] Remove AverageLike and ...

2018-10-26 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22856

[SPARK-25856][SQL][MINOR] Remove AverageLike and CountLike classes

## What changes were proposed in this pull request?
These two classes were added for regr_ expression support (SPARK-23907). 
These have been removed and hence we can remove these base classes and inline 
the logic in the concrete classes.
## How was this patch tested?
Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark average_cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22856.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22856


commit 6becdd55e173bca40943e2aafda6a49cf5eb9934
Author: Dilip Biswal 
Date:   2018-10-26T20:28:20Z

[SPARK-25856] Remove AverageLike and CountLike classes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228644504
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+abstract class UnevaluableBooleanAggBase(arg: Expression)
--- End diff --

@mgaido91 I tried it 
[here](https://github.com/apache/spark/compare/master...dilipbiswal:SPARK-19851-rewrite?expand=1).
 I had trouble getting it to work for window expressions. Thats why @cloud-fan 
suggested to try the current approach in this 
[comment](https://github.com/apache/spark/pull/22797#issuecomment-432106729)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228638310
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+abstract class UnevaluableBooleanAggBase(arg: Expression)
--- End diff --

@mgaido91 Actually i tried a few different ideas. The are in the comment of 
original PR [link](https://github.com/apache/spark/pull/22047).  I had prepared 
two branches with two approaches .. [comment] 
(https://github.com/apache/spark/pull/22047#discussion_r225700828). Could you 
please take a look ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228636110
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -282,6 +283,31 @@ trait RuntimeReplaceable extends UnaryExpression with 
Unevaluable {
   override lazy val canonicalized: Expression = child.canonicalized
 }
 
+/**
+ * An aggregate expression that gets rewritten (currently by the 
optimizer) into a
+ * different aggregate expression for evaluation. This is mainly used to 
provide compatibility
+ * with other databases. For example, we use this to support every, 
any/some aggregates by rewriting
+ * them with Min and Max respectively.
+ */
+trait UnevaluableAggregate extends DeclarativeAggregate {
+
+  override def nullable: Boolean = true
--- End diff --

@mgaido91 I think for aggregates, its different ? Please `Max`, `Min`, they 
all define it to be nullable. I think they work on group of rows and can return 
null on empty input.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228634827
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -282,6 +283,31 @@ trait RuntimeReplaceable extends UnaryExpression with 
Unevaluable {
   override lazy val canonicalized: Expression = child.canonicalized
 }
 
+/**
+ * An aggregate expression that gets rewritten (currently by the 
optimizer) into a
+ * different aggregate expression for evaluation. This is mainly used to 
provide compatibility
+ * with other databases. For example, we use this to support every, 
any/some aggregates by rewriting
+ * them with Min and Max respectively.
+ */
+trait UnevaluableAggregate extends DeclarativeAggregate {
+
+  override def nullable: Boolean = true
--- End diff --

@mgaido91 most of the aggregates are nullable, no ? Did you have an 
suggestion here ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228634159
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+abstract class UnevaluableBooleanAggBase(arg: Expression)
--- End diff --

@mgaido91 RuntimeReplaceble works for scalar expressions but not for 
aggregate expressions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...

2018-10-26 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22809
  
@cloud-fan @mgaido91 I have incorporated the comments. Could we please 
check if things look okay now ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228239274
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala
 ---
@@ -57,3 +57,34 @@ case class Max(child: Expression) extends 
DeclarativeAggregate {
 
   override lazy val evaluateExpression: AttributeReference = max
 }
+
+abstract class AnyAggBase(arg: Expression)
+  extends UnevaluableAggrgate with ImplicitCastInputTypes {
+
+  override def children: Seq[Expression] = arg :: Nil
+
+  override def dataType: DataType = BooleanType
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+arg.dataType match {
+  case dt if dt != BooleanType =>
+TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' 
should have been " +
+  s"${BooleanType.simpleString}, but it's 
[${arg.dataType.catalogString}].")
+  case _ => TypeCheckResult.TypeCheckSuccess
+}
+  }
+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is 
true.")
--- End diff --

@mgaido91 @cloud-fan Thanks .. will add.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228235105
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala
 ---
@@ -57,3 +57,34 @@ case class Max(child: Expression) extends 
DeclarativeAggregate {
 
   override lazy val evaluateExpression: AttributeReference = max
 }
+
+abstract class AnyAggBase(arg: Expression)
+  extends UnevaluableAggrgate with ImplicitCastInputTypes {
+
+  override def children: Seq[Expression] = arg :: Nil
+
+  override def dataType: DataType = BooleanType
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+arg.dataType match {
+  case dt if dt != BooleanType =>
+TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' 
should have been " +
+  s"${BooleanType.simpleString}, but it's 
[${arg.dataType.catalogString}].")
+  case _ => TypeCheckResult.TypeCheckSuccess
+}
+  }
+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is 
true.")
--- End diff --

@mgaido91 Yeah... If we look at the definition of other aggregate function 
like Max, Min etc, they don't seem to have the `@Since`.  However, they are 
defined in the for `functions.scala` for max and min as `@Since 1.3`.  So 
basically i was not sure on the what is the rule when a function is not exposed 
in dataset api but only from SQL.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228229829
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala
 ---
@@ -57,3 +57,34 @@ case class Max(child: Expression) extends 
DeclarativeAggregate {
 
   override lazy val evaluateExpression: AttributeReference = max
 }
+
+abstract class AnyAggBase(arg: Expression)
--- End diff --

@mgaido91 I had a confusion on where to house this class ? Is it okay if i 
just rename it and keep it in Max.scala ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228227624
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala
 ---
@@ -57,3 +57,34 @@ case class Max(child: Expression) extends 
DeclarativeAggregate {
 
   override lazy val evaluateExpression: AttributeReference = max
 }
+
+abstract class AnyAggBase(arg: Expression)
+  extends UnevaluableAggrgate with ImplicitCastInputTypes {
+
+  override def children: Seq[Expression] = arg :: Nil
+
+  override def dataType: DataType = BooleanType
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+arg.dataType match {
+  case dt if dt != BooleanType =>
+TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' 
should have been " +
+  s"${BooleanType.simpleString}, but it's 
[${arg.dataType.catalogString}].")
+  case _ => TypeCheckResult.TypeCheckSuccess
+}
+  }
+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is 
true.")
--- End diff --

I will add. One observation : 

I see that, only in the API's we specify the @since .. for example for 
aggregate Max, we have it in api function.scala:Max ..  These aggregates are 
not exposed in the dataset apis and none of the other aggregates seem to have 
it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22821: [SPARK-25832][SQL] remove newly added map related functi...

2018-10-25 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22821
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-24 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r228020366
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -727,4 +728,67 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   "grouping expressions: [current_date(None)], value: [key: int, 
value: string], " +
 "type: GroupBy]"))
   }
+
+  def getEveryAggColumn(columnName: String): Column = {
+Column(new 
EveryAgg(Column(columnName).expr).toAggregateExpression(false))
--- End diff --

@cloud-fan Ok, let me remove these.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-24 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r227901365
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/group-by.sql ---
@@ -80,3 +80,69 @@ SELECT 1 FROM range(10) HAVING true;
 SELECT 1 FROM range(10) HAVING MAX(id) > 0;
 
 SELECT id FROM range(10) HAVING id > 0;
+
+-- Test data
+CREATE OR REPLACE TEMPORARY VIEW test_agg AS SELECT * FROM VALUES
+  (1, true), (1, false),
+  (2, true),
+  (3, false), (3, null),
+  (4, null), (4, null),
+  (5, null), (5, true), (5, false) AS test_agg(k, v);
+
+-- empty table
+SELECT every(v), some(v), any(v) FROM test_agg WHERE 1 = 0;
+
+-- all null values
+SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 4;
+
+-- aggregates are null Filtering
+SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 5;
+
+-- group by
+SELECT k, every(v), some(v), any(v) FROM test_agg GROUP BY k;
+
+-- having
+SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) = false;
+SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) IS NULL;
+
+-- basic subquery path to make sure rewrite happens in both parent and 
child plans.
+SELECT k,
+   Every(v) AS every
+FROM   test_agg
+WHERE  k = 2
+   AND v IN (SELECT Any(v)
+ FROM   test_agg
+ WHERE  k = 1)
+GROUP  BY k;
+
+-- basic subquery path to make sure rewrite happens in both parent and 
child plans.
+SELECT k,
+   Every(v) AS every
+FROM   test_agg
+WHERE  k = 2
+   AND v IN (SELECT Every(v)
+ FROM   test_agg
+ WHERE  k = 1)
+GROUP  BY k;
+
+-- input type checking Int
+SELECT every(1);
+
+-- input type checking Short
+SELECT some(1S);
+
+-- input type checking Long
+SELECT any(1L);
+
+-- input type checking String
+SELECT every("true");
+
+-- every/some/any aggregates are not supported as windows expression.
+SELECT k, v, every(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg;
--- End diff --

@cloud-fan here are the a few window tests. (fyi)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...

2018-10-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22809
  
@cloud-fan Yeah.. I have some tests in group-by.sql. Please take a look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-24 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22809#discussion_r227881077
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala
 ---
@@ -38,6 +39,18 @@ object ReplaceExpressions extends Rule[LogicalPlan] {
   }
 }
 
+/**
--- End diff --

@cloud-fan We could also add these transformations in `ReplaceExpressions` 
rule and not require a dedicated rule (fyi).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...

2018-10-23 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22809
  
cc @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-23 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22809

[SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates

## What changes were proposed in this pull request?

Implements Every, Some, Any aggregates in SQL. These new aggregate 
expressions are analyzed in normal way and rewritten to equivalent existing 
aggregate expressions in the optimizer.

Every(x) => Min(x)  where x is boolean.
Some(x) => Max(x) where x is boolean.

Any is a synonym for Some.
SQL
```
explain extended select every(v) from test_agg group by k;
```
Plan : 
```
== Parsed Logical Plan ==
'Aggregate ['k], [unresolvedalias('every('v), None)]
+- 'UnresolvedRelation `test_agg`

== Analyzed Logical Plan ==
every(v): boolean
Aggregate [k#0], [every(v#1) AS every(v)#5]
+- SubqueryAlias `test_agg`
   +- Project [k#0, v#1]
  +- SubqueryAlias `test_agg`
 +- LocalRelation [k#0, v#1]

== Optimized Logical Plan ==
Aggregate [k#0], [min(v#1) AS every(v)#5]
+- LocalRelation [k#0, v#1]

== Physical Plan ==
*(2) HashAggregate(keys=[k#0], functions=[min(v#1)], output=[every(v)#5])
+- Exchange hashpartitioning(k#0, 200)
   +- *(1) HashAggregate(keys=[k#0], functions=[partial_min(v#1)], 
output=[k#0, min#7])
  +- LocalTableScan [k#0, v#1]
Time taken: 0.512 seconds, Fetched 1 row(s)
```

## How was this patch tested?
Added tests in SQLQueryTestSuite, DataframeAggregateSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-19851-specific-rewrite

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22809.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22809


commit e6c5c84ea189b1fdbac8408594c880950b6b7398
Author: Dilip Biswal 
Date:   2018-10-16T06:23:41Z

tests

commit b793d06cb937db300c78e4eb4cd143c385419e57
Author: Dilip Biswal 
Date:   2018-10-23T16:43:04Z

Code changes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22797: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...

2018-10-23 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22797
  
@cloud-fan OK.. thanks a LOT.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22797: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...

2018-10-23 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22797
  
@cloud-fan Sure.. Let me give that a try. Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22797: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...

2018-10-22 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22797
  
cc @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22797: [SPARK-19851][SQL] Add support for EVERY and ANY ...

2018-10-22 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22797

[SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates

## What changes were proposed in this pull request?
Implements Every, Some, Any aggregates in SQL. Logically these aggregate 
expressions are mapped to Min and Max, respectively.

```
Every(x) => Min(x)  where x is boolean.
Some(x) => Max(x) where x is boolean.
```

Any is a synonym for Some.

## How was this patch tested?
Added 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-19851-extend

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22797.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22797






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...

2018-10-22 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22047#discussion_r227054412
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -403,6 +403,28 @@ def countDistinct(col, *cols):
 return Column(jc)
 
 
+def every(col):
--- End diff --

@cloud-fan Thank you very much for your response. I will create a new PR 
based on option-1 today and close this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links

2018-10-20 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22772
  
Thank you @gatorsmile @HyukjinKwon @xuanyuanking 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links

2018-10-18 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22772
  
@HyukjinKwon Sure..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links

2018-10-18 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22772
  
@HyukjinKwon Sure. If there are more changes planned in another PR, then it 
can take these changes in. Since this went to 2.4, i thought there may be some 
urgency to fix the links. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links

2018-10-18 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22772
  
cc @dongjoon-hyun @gatorsmile @xuanyuanking


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken...

2018-10-18 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22772

[SPARK-24499][SQL][DOC][Followup] Fix some broken links

## What changes were proposed in this pull request?
Fix some broken links in the new document. I have clicked through all the 
links. Hopefully i haven't missed any :-)

## How was this patch tested?
Built using jekyll and verified the links.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark doc_check

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22772.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22772


commit 2048c39e09a2f73c16e0e1ab4f0124516adf51a3
Author: Dilip Biswal 
Date:   2018-10-18T23:56:08Z

[SPARK-24499][DOC] Fix some broken links




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...

2018-10-16 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22670
  
@srowen Thank you very much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...

2018-10-16 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22047#discussion_r225700828
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -403,6 +403,28 @@ def countDistinct(col, *cols):
 return Column(jc)
 
 
+def every(col):
--- End diff --

@gatorsmile Hi Sean, I have prepared two branches. One in which these new 
aggregate functions are extending from the base Max and Min class basically 
reusing code. The other in which we replace these aggregate expressions in the 
optimizer. Below are the links.

1. 
[branch-extend](https://github.com/dilipbiswal/spark/tree/SPARK-19851-extend)

2. 
[branch-rewrite](https://github.com/dilipbiswal/spark/tree/SPARK-19851-rewrite)

I would prefer option 1 because of the following reasons.
1. Code changes are simpler
2. Supports these aggregates as window expressions naturally. In the other 
option i have
to block it. 
3. It seems to me for these simple mapping, we probably don't need a 
rewrite frame work. We could add it in the future if we need a little complex 
transformation.

Please let me know how we want to move forward with this. Thanks !!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...

2018-10-12 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22670
  
@srowen OK, Let me look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...

2018-10-12 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22670
  
@srowen Thanks. Did you mean, the test cases should extend a shared spark 
context (SharedKafkaSparkContext) which would have this property set ?

Actually Sean, there are 3 suites in this directory. 
`DirectKafkaStreamSuite.scala` `KafkaDataConsumerSuite.scala` and  
`KafkaRDDSuite.scala`. Given only two tests which are affected by this timeout 
(which we are fixing here), do you think we need to take on this refactoring 
work as part of this PR ? Please let me know.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22671: [SPARK-25615][SQL][TEST] Improve the test runtime of Kaf...

2018-10-11 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22671
  
Thanks a LOT @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-10 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
Thank you very much @srowen @fjh100456 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22671: [SPARK-25615][SQL][TEST] Improve the test runtime...

2018-10-09 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22671#discussion_r223819177
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
 ---
@@ -332,7 +332,9 @@ class KafkaSinkSuite extends StreamTest with 
SharedSQLContext with KafkaTest {
 var ex: Exception = null
 try {
   ex = intercept[StreamingQueryException] {
-writer = createKafkaWriter(input.toDF(), withTopic = Some(topic))()
+writer = createKafkaWriter(input.toDF(),
--- End diff --

@srowen I have used 5 sec timeout for now and i have moved it to 
createKafkaWriter per ur suggestion. Pl.. let me know if you are okay. 
Currently there is only one test that make uses of this timeout value. So keep 
it conservatively at 5 secs ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22671: [SPARK-25615][SQL][TEST] Improve the test runtime...

2018-10-09 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22671#discussion_r223771001
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
 ---
@@ -332,7 +332,9 @@ class KafkaSinkSuite extends StreamTest with 
SharedSQLContext with KafkaTest {
 var ex: Exception = null
 try {
   ex = intercept[StreamingQueryException] {
-writer = createKafkaWriter(input.toDF(), withTopic = Some(topic))()
+writer = createKafkaWriter(input.toDF(),
--- End diff --

@srowen Thank you. I thought about it initially, but went with a 
conservative fix.

> it seems like it's a generally good idea to not let anything block that 
long in tests
I agree.

> In fact KafkaContinuousSinkSuite sets it to 1000. We should standardize 
that too
Yeah.. wasn't sure on value to set. If we think 1 second is a good enough 
timeout setting, i can change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-09 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
@fjh100456 Yeah. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22671: [SPARK-25615][SQL][TEST] Improve the test runtime of Kaf...

2018-10-08 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22671
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22671: [SPARK-25615][SQL][TEST] Improve the test runtime...

2018-10-08 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22671

[SPARK-25615][SQL][TEST] Improve the test runtime of KafkaSinkSuite: 
streaming write to non-existing topic

## What changes were proposed in this pull request?
Specify `kafka.max.block.ms` to 10 seconds while creating the kafka writer. 
In the absence of this overridden config, by default it uses a default time out 
of 60 seconds.

With this change the test completes in close to 10 seconds as opposed to 1 
minute.

## How was this patch tested?
This is a test fix.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25615

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22671


commit 6a36250dec0e23520f482d8d092cb658ce147fb7
Author: Dilip Biswal 
Date:   2018-10-08T09:07:47Z

[SPARK-25615] Improve the test runtime of KafkaSinkSuite: streaming - write 
to non-existing topic




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...

2018-10-08 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22594
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...

2018-10-08 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22670
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the...

2018-10-08 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22670

[SPARK-25631][SPARK-25632][SQL][TEST] Improve the test runtime of 
KafkaRDDSuite

## What changes were proposed in this pull request?
Set a reasonable poll timeout thats used while consuming topics/partitions 
from kafka. In the 
absence of it, a default of 2 minute is used as the timeout values. And all 
the negative tests take a minimum of 2 minute to execute.

After this change, we save about 4 minutes in this suite.

## How was this patch tested?
Test fix.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25631

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22670.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22670


commit 78bf63958e2f85490b2a23fbb249c6da6585d7a6
Author: Dilip Biswal 
Date:   2018-10-08T05:55:28Z

[SPARK-25631][SPARK-25632] Improve the test runtime of KafkaRDDSuite

commit 47e0940a0795b85d0b2c34596fa5e21850feb5cb
Author: Dilip Biswal 
Date:   2018-10-08T06:05:06Z

doc fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-06 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
@srowen @gatorsmile 
Let me see if we can target a low hanging fruit first :-). Given the same 
input set is used for table and session codecs, we can skip when they both are 
same value . This way, we can do the inner loop 6 times as opposed to 9 times 
cutting to total testing combination from 54 to 36.

Does this sound reasonable ?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-06 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
@srowen 
> I would vote against running tests that we think have any value randomly. 
It's just the wrong way to solve problems, as much as it would be to simply run 
90% of our test suites each time on the theory that eventually we'd catch bugs.

OK. 

> If there are, say, 3 codecs, and the point is to test whether one 
specified codec overrides another, does that really need more than 1 test? is 
there any reason to believe that override works/doesn't work differently for 
different codecs? Or are 3 tests sufficient, one to test overriding of each?

@gatorsmile Would you have some input here. Can we reduce the codec input 
set, given we are just testing precedence 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-06 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
@srowen Thank you for your comments. Actually from a cursory look, i would 
agree that it does not look that pretty. I also agree that it does look like we 
are not testing as much as we used to. 

This testcase was added as part of SPARK-21786 and if we look at what this 
is trying to test, its basically testing the following the precedence of how 
compression codec is choosen : 

```
  val codecName = parameters
  .get("compression")
  .orElse(parquetCompressionConf)
  .getOrElse(sqlConf.parquetCompressionCodec)
  .toLowerCase(Locale.ROOT)
```
Basically what we are trying to test is , if table and session level codecs 
are specified which one wins ? So we could have just tested this with 1 value 
of table codec (say snappy) and one value of session codec say (gzip). But we 
are trying to be extra cautious and testing a cross product of 3 * 3 
combination. It seems to me the 3 values that we have chosen are probably the 
most commonly used ones. So i wanted to preserve this input set .. but decide 
which combination to test randomly. Also, like i mentioned above, we have a 6 
way loop on top , which mean in 1 run, we would probably pick 6 out of 9 
combination of codecs.  And in so many runs of jenkins, we will eventually test 
all the combination that we wanted to test in the first place there by catching 
any regression that occurs.
 
Given the code that we are trying to test, it would be extremely rare that 
it would work for one codec combination but fails for another as the logic is 
codec value agnostic but merely a precedence check. However we are taking a hit 
for every run , by developers who run on their laptop and all the jenkin runs 
that happens automatically.

So we have a few options : 
 1) Reduce the input codec list from 3 to 2 (or 1) => number of test 
combinations goes down to 24 from 54
2) Do what the pr is doing - pick 1 at random => number of test combination 
is 6 in a given run
3) Do nothing

I will do whatever you guys prefer here... Please advice.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-10-06 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/21732
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22500: [SPARK-25488][SQL][TEST] Refactor MiscBenchmark to use m...

2018-10-06 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22500
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...

2018-10-06 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22448
  
@HyukjinKwon OK.. will do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
@mgaido91 
Thanks for your input.

I took another look at the testcase. Let me outline some of my 
understandings first.
 
- The test validates the precedence rules in determining the resultant 
compression to be used in the presence of SessionLevel codecs and Table level 
codecs.
- It verifies the correct compression is picked by reading the metadata 
information from parquet/orc file metadata.
- The accepted configuration for parquet are : none, uncompressed, snappy, 
gzip, lzo, brotli, lz4, zstd
- The accepted configuration for orc are : none, uncompressed, snappy, 
zlib, lzo
-  The testcase in question use only a SUBSET of allowable codecs for 
parquet : 
uncompressed, snappy, gzip 
- The test case in question use only a SUBSET of allowable codecs for orc : 
None, Snappy, Zlib

One thing to note is that, the codecs being tested are not exhaustive and 
we pick a subset (perhaps the most popular ones). Other thing is that, we have 
a 3 way loop 1) isPartitioned 2) convertMetastore 3) useCTAS on top of the 
codec loop. So we will be calling the codec loop 6 times in a test for each 
unique combination of (isPartitioned, convertMetastore, useCTAS). And we have 
changed the codec loop to randomly pick one combination of table level and 
session level codecs.

Given this, i feel we are getting a decent coverage and also i feel we 
should be able to catch regression as we will catch it in some jenkin run or 
the other. If you still feel uncomfortable, should we take 2 codecs as opposed 
to 1 ? It will generate a 24 (4 * 6)  times loop as opposed to 54 (9 * 6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22648: [MINOR] Clean up the joinCriteria in SQL parser

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22648
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22047#discussion_r223171963
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -403,6 +403,28 @@ def countDistinct(col, *cols):
 return Column(jc)
 
 
+def every(col):
--- End diff --

@gatorsmile OK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22638
  
Thanks a lot @gatorsmile @mgaido91 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22644: [SPARK-25626][SQL][TEST] Improve the test execution time...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22644
  
Thank you very much @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22644: [SPARK-25626][SQL][TEST] Improve the test execution time...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22644
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22642: [SPARK-25613][SPARK-25614][TEST] Add tag ExtendedHiveTes...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22642
  
@dongjoon-hyun Ah.. got it.. thanks a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22644: [SPARK-25626] Improve the test execution time of ...

2018-10-05 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22644

[SPARK-25626] Improve the test execution time of HiveClientSuites

## What changes were proposed in this pull request?
Improve the runtime by reducing the number of partitions created in the 
test. The number of partitions are reduced from 280 to 60.

## How was this patch tested?
Test only.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25626

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22644.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22644


commit ec991dda631d3918fd43d9b2a4f38300767778d3
Author: Dilip Biswal 
Date:   2018-10-05T16:08:11Z

[SPARK-25626] Improve the test execution time of HiveClientSuites




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22642: [SPARK-25613][SPARK-25614][TEST] Add tag ExtendedHiveTes...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22642
  
@gengliangwang I have a question. What happens when we place this tag ? It 
is skipped on regular runs ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22047: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22047
  
@cloud-fan please see my comment 
[link](https://github.com/apache/spark/pull/22047#issuecomment-411536039). I 
had tried to rewrite using max and min as suggested by Herman and Reynold in 
the original pr. I was unable to do it when the aggregate is part of the 
window. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
@mgaido91 Trying to understand the concern regarding "working combination 
and a non-working " comment. In my understanding, originally we were doing a 
cross join between two sets of codecs. so if outer loop has 2 elements and 
inner has 3 , then we would test 6 combinations. With the current change , in a 
given run, we would just randomly pick one out of that 6 combination with a 
hope that another run would pick a different combination out of the 6 possible 
combination.  In case of a failure in this test, we should run the full tests 
i.e all 6 combination to identify the issue. Please let me know what i could be 
missing here ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22641: [SPARK-25611][SPARK-25622][SQL][TESTS] Improve test run ...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22641
  
@mgaido91 Could you please take a look ? Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22641: [SPARK-25611][SPARK-25622][SQL][TESTS] Improve te...

2018-10-05 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22641

[SPARK-25611][SPARK-25622][SQL][TESTS] Improve test run time of 
CompressionCodecSuite

## What changes were proposed in this pull request?
I am using the idea of @mgaido91 to pick the test codecs  randomly and 
selecting 1 in each loop to reduce the test combinations.

The test runtime below : 
```
[info] - both table-level and session-level compression are set (20 
seconds, 607 milliseconds)
[info] - table-level compression is not set but session-level compressions 
is set  (14 seconds, 584 milliseconds)

[info] - both table-level and session-level compression are set (20 
seconds, 140 milliseconds)
[info] - table-level compression is not set but session-level compressions 
is set  (14 seconds, 393 milliseconds)

[info] - both table-level and session-level compression are set (21 
seconds, 410 milliseconds)
[info] - table-level compression is not set but session-level compressions 
is set  (14 seconds, 778 milliseconds)
```
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25611

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22641.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22641


commit 2d7ffa65be68a34a1c80daea207b26ffbb008afd
Author: Dilip Biswal 
Date:   2018-10-05T09:57:46Z

[SPARK-25611][SPARK-25622] Improve test run time of CompressionCodecSuite




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >