[GitHub] spark issue #23108: [Spark-25993][SQL][TEST]Add test cases for CREATE EXTERN...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/23108 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22141 @maropu Very sorry. I haven't had the time to come back to it. I have some stuff on my plate. I will get to this after i am done. Thanks !! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23108: [Spark-25993][SQL][TEST]Add test cases for CREATE EXTERN...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/23108 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and R...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/23211 @wangyum Thanks.. Can you please tell me how you generate this ? Also, is it possible to get runtimes of these queries to see if there are any regressions ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22899 @gatorsmile Thanks a lot. I completely agree that we should try and combine these two. I will continue to think about it :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and R...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/23211 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/23211 [SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries Currently predicate subqueries (IN/EXISTS) are converted to Joins at the end of optimizer in RewritePredicateSubquery. This change moves the rewrite close to beginning of optimizer. The original idea was to keep the subquery expressions in Filter form so that we can push them down as deep as possible. One disadvantage is that, after the subqueries are rewritten in join form, they are not subjected to further optimizations. In this change, we convert the subqueries to join form early in the rewrite phase and then add logic to push the left-semi and left-anti joins down like we do for normal filter ops. I can think of the following advantages : 1. We will produce consistent optimized plans for subqueries written using SQL dialect and data frame apis. 2. Will hopefully make it easier to do the next phase of de-correlations when we opens up more cases of de-correlation. In this case, it would be beneficial to expose the rewritten queries to all the other optimization rules. 3. We can now hopefully get-rid of PullupCorrelatedPredicates rule and combine ths with RewritePredicateSubquery. I haven't tried it. Will take it on a followup. (P.S Thanks to Natt for his original work in [here](https://github.com/apache/spark/pull/17520). I have based this pr on his work) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-19712-NEW Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23211.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23211 commit f4bb126472eb5a808a3ae94bcfb59e0674e01217 Author: Dilip Biswal Date: 2018-12-03T22:06:24Z [SPARK-19712] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23170: [SPARK-24423][FOLLOW-UP][SQL] Fix error example
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/23170 Looks good to me. Thanks for fixing this @wangyum. I guess, i tried to trim off the example to show the mutual exclusivity of these two parameters in question. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r232571944 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala --- @@ -57,3 +57,34 @@ case class Max(child: Expression) extends DeclarativeAggregate { override lazy val evaluateExpression: AttributeReference = max } + +abstract class AnyAggBase(arg: Expression) + extends UnevaluableAggrgate with ImplicitCastInputTypes { + + override def children: Seq[Expression] = arg :: Nil + + override def dataType: DataType = BooleanType + + override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType) + + override def checkInputDataTypes(): TypeCheckResult = { +arg.dataType match { + case dt if dt != BooleanType => +TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' should have been " + + s"${BooleanType.simpleString}, but it's [${arg.dataType.catalogString}].") + case _ => TypeCheckResult.TypeCheckSuccess +} + } +} + +@ExpressionDescription( + usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.") --- End diff -- @HyukjinKwon Sure.. I will open a pr shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22899 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22899 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r229797016 --- Diff: sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out --- @@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1 struct<> -- !query 18 output org.apache.spark.sql.AnalysisException -cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 +cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 --- End diff -- @cloud-fan Makes sense. Thanks for clarification... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22899 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22900: [SPARK-25618][SQL][TEST] Reduce time taken to execute Ka...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22900 Oh...thank you very much @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22900: [SPARK-25618][SQL][TEST] Reduce time taken to exe...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22900#discussion_r229625779 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala --- @@ -221,7 +221,7 @@ class KafkaSourceStressForDontFailOnDataLossSuite extends StreamTest with KafkaM .as[(String, String)] val query = startStream(kafka.map(kv => kv._2.toInt)) -val testTime = 1.minutes +val testTime = 20.seconds --- End diff -- @dongjoon-hyun Actually i did count the number of times we went inside each case block. With this change, we did a little less than we used to do before.. but if i remember, we did enough kafka operations. I will get the count of each case block tomorrow for your perusal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22900: [SPARK-25618][SQL][TEST] Reduce time taken to execute Ka...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22900 cc @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22900: [SPARK-25618][SQL][TEST] Reduce time taken to execute Ka...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22900 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22895: [SPARK-25886][SQL][Minor] Improve error message of `Fail...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22895 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/21860 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22847 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22899 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r229578738 --- Diff: sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out --- @@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1 struct<> -- !query 18 output org.apache.spark.sql.AnalysisException -cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 +cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 --- End diff -- @cloud-fan I had a question, the 3rd example from huaxin, wanted to confirm again if the output looks okay to you. **from huaxin** ``` $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql ``` with my previous fix: ``` `a`.`b` ``` make sql same as name: ``` a.b ``` Is it okay to drop the backtick from the first identifier ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r229577806 --- Diff: sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out --- @@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1 struct<> -- !query 18 output org.apache.spark.sql.AnalysisException -cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 +cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 --- End diff -- @huaxingao Were you planning to make a change to make `name` same as `sql` ? Currently they are different ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22900: [SPARK-25618][SQL][TEST] Reduce time taken to exe...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22900 [SPARK-25618][SQL][TEST] Reduce time taken to execute KafkaContinuousSourceStressForDontFailOnDataLossSuite ## What changes were proposed in this pull request? In this test, i have reduced the test time to 20 secs from 1 minute while reducing the sleep time from 1 sec to 100 milliseconds. With this change, i was able to run the test in 20+ seconds consistently on my laptop. I would like see if it passes in jenkins consistently. ## How was this patch tested? Its a test fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25618 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22900.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22900 commit 5a6907cd1dc0872d68214bc42f1b2898f9e48009 Author: Dilip Biswal Date: 2018-10-17T05:08:19Z [SPARK-25618] Reduce time taken to execute KafkaContinuousSourceStressForDontFailOnDataLossSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22899: [SPARK-25573] Combine resolveExpression and resolve in t...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22899 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22899: [SPARK-25573] Combine resolveExpression and resol...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22899 [SPARK-25573] Combine resolveExpression and resolve in the Analyzer ## What changes were proposed in this pull request? Currently in the Analyzer, we have two methods 1) Resolve 2)ResolveExpressions that are called at different code paths to resolve attributes, column ordinal and extract value expressions. In this PR, we combine the two into one method to make sure, there is only one method that is tasked with resolving the attributes. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25573-final Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22899.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22899 commit ff375690c7a884c52fa4683116570a964e46a96d Author: Dilip Biswal Date: 2018-10-31T05:36:30Z [SPARK-25573] Combine resolveExpression and resolve in the Analyzer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/17520 @gatorsmile Sure Sean. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229362181 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- @gatorsmile Sure Sean.. Let me give it a try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22778#discussion_r229181821 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) // "Extract PythonUDF From JoinCondition". Batch("Check Cartesian Products", Once, CheckCartesianProducts) :+ -Batch("RewriteSubquery", Once, +Batch("Rewrite Subquery", Once, --- End diff -- @gatorsmile Do you think its a good time to revisit Natt's PR to convert subquery expressions to Joins early in the optimization process ? Perhaps then we can take advantage of all the subsequent rules firing after the subquery rewrite ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229063489 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, such as: * View * If column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created -by Hive, users should explicitly specify column aliases in view definition queries. +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- @dongjoon-hyun oh.. thanks .. because it requires an explicit correlation ? Sorry, don't have 1.2.2 env to try out .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229062732 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- @dongjoon-hyun i was thinking, calling upper on a int column is probably not very intuitive :-) What do you think about adding a string literal in the projection ? ``` SELECT c + 1, upper(d) FROM select 1 c, 'test' as d ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229051205 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, such as: * View * If column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created -by Hive, users should explicitly specify column aliases in view definition queries. +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- nit : We could perhaps simplify the query to : ``` CREATE VIEW v1 AS (SELECT c1 + 1, upper(c2) FROM t1); ``` what do you think ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22856: [SPARK-25856][SQL][MINOR] Remove AverageLike and CountLi...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22856 Thank you @srowen @dongjoon-hyun @mgaido91 @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r228766091 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. --- End diff -- @seancxmao Thanks for adding the doc. Can a small example here help illustrate this better ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22797: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal closed the pull request at: https://github.com/apache/spark/pull/22797 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22809 Thanks a lot @cloud-fan @mgaido91 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...
Github user dilipbiswal closed the pull request at: https://github.com/apache/spark/pull/22047 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228725645 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +abstract class UnevaluableBooleanAggBase(arg: Expression) --- End diff -- @cloud-fan @mgaido91 Thank you. I have added a TODO for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228666789 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +abstract class UnevaluableBooleanAggBase(arg: Expression) --- End diff -- @mgaido91 I re-read your comments again and here is what i think you are implying : 1. In UnevaluableAggregate, define a method say `replacedExpression` 2. The sub-class overrides it and return an actual rewritten expression 3. In optimized, match UnevaluableAggregate and call `replacedExpression` Did i understand it correctly ? If so, actually, i wouldn't prefer that way. The reason is, the replacedExpression is hidden from analyzer and so it may not be safe. The way ReplaceExpression framework is nicely designed, the analyzer resolves the rewritten expression normally (as its the child expression). Thats the reason, i have opted for specific/targeted rewrites. If Wenchen and you think otherwise, then i can change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22856: [SPARK-25856][SQL][MINOR] Remove AverageLike and ...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22856 [SPARK-25856][SQL][MINOR] Remove AverageLike and CountLike classes ## What changes were proposed in this pull request? These two classes were added for regr_ expression support (SPARK-23907). These have been removed and hence we can remove these base classes and inline the logic in the concrete classes. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark average_cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22856.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22856 commit 6becdd55e173bca40943e2aafda6a49cf5eb9934 Author: Dilip Biswal Date: 2018-10-26T20:28:20Z [SPARK-25856] Remove AverageLike and CountLike classes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228644504 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +abstract class UnevaluableBooleanAggBase(arg: Expression) --- End diff -- @mgaido91 I tried it [here](https://github.com/apache/spark/compare/master...dilipbiswal:SPARK-19851-rewrite?expand=1). I had trouble getting it to work for window expressions. Thats why @cloud-fan suggested to try the current approach in this [comment](https://github.com/apache/spark/pull/22797#issuecomment-432106729) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228638310 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +abstract class UnevaluableBooleanAggBase(arg: Expression) --- End diff -- @mgaido91 Actually i tried a few different ideas. The are in the comment of original PR [link](https://github.com/apache/spark/pull/22047). I had prepared two branches with two approaches .. [comment] (https://github.com/apache/spark/pull/22047#discussion_r225700828). Could you please take a look ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228636110 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -282,6 +283,31 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable { override lazy val canonicalized: Expression = child.canonicalized } +/** + * An aggregate expression that gets rewritten (currently by the optimizer) into a + * different aggregate expression for evaluation. This is mainly used to provide compatibility + * with other databases. For example, we use this to support every, any/some aggregates by rewriting + * them with Min and Max respectively. + */ +trait UnevaluableAggregate extends DeclarativeAggregate { + + override def nullable: Boolean = true --- End diff -- @mgaido91 I think for aggregates, its different ? Please `Max`, `Min`, they all define it to be nullable. I think they work on group of rows and can return null on empty input. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228634827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -282,6 +283,31 @@ trait RuntimeReplaceable extends UnaryExpression with Unevaluable { override lazy val canonicalized: Expression = child.canonicalized } +/** + * An aggregate expression that gets rewritten (currently by the optimizer) into a + * different aggregate expression for evaluation. This is mainly used to provide compatibility + * with other databases. For example, we use this to support every, any/some aggregates by rewriting + * them with Min and Max respectively. + */ +trait UnevaluableAggregate extends DeclarativeAggregate { + + override def nullable: Boolean = true --- End diff -- @mgaido91 most of the aggregates are nullable, no ? Did you have an suggestion here ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228634159 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +abstract class UnevaluableBooleanAggBase(arg: Expression) --- End diff -- @mgaido91 RuntimeReplaceble works for scalar expressions but not for aggregate expressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22809 @cloud-fan @mgaido91 I have incorporated the comments. Could we please check if things look okay now ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228239274 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala --- @@ -57,3 +57,34 @@ case class Max(child: Expression) extends DeclarativeAggregate { override lazy val evaluateExpression: AttributeReference = max } + +abstract class AnyAggBase(arg: Expression) + extends UnevaluableAggrgate with ImplicitCastInputTypes { + + override def children: Seq[Expression] = arg :: Nil + + override def dataType: DataType = BooleanType + + override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType) + + override def checkInputDataTypes(): TypeCheckResult = { +arg.dataType match { + case dt if dt != BooleanType => +TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' should have been " + + s"${BooleanType.simpleString}, but it's [${arg.dataType.catalogString}].") + case _ => TypeCheckResult.TypeCheckSuccess +} + } +} + +@ExpressionDescription( + usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.") --- End diff -- @mgaido91 @cloud-fan Thanks .. will add. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228235105 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala --- @@ -57,3 +57,34 @@ case class Max(child: Expression) extends DeclarativeAggregate { override lazy val evaluateExpression: AttributeReference = max } + +abstract class AnyAggBase(arg: Expression) + extends UnevaluableAggrgate with ImplicitCastInputTypes { + + override def children: Seq[Expression] = arg :: Nil + + override def dataType: DataType = BooleanType + + override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType) + + override def checkInputDataTypes(): TypeCheckResult = { +arg.dataType match { + case dt if dt != BooleanType => +TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' should have been " + + s"${BooleanType.simpleString}, but it's [${arg.dataType.catalogString}].") + case _ => TypeCheckResult.TypeCheckSuccess +} + } +} + +@ExpressionDescription( + usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.") --- End diff -- @mgaido91 Yeah... If we look at the definition of other aggregate function like Max, Min etc, they don't seem to have the `@Since`. However, they are defined in the for `functions.scala` for max and min as `@Since 1.3`. So basically i was not sure on the what is the rule when a function is not exposed in dataset api but only from SQL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228229829 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala --- @@ -57,3 +57,34 @@ case class Max(child: Expression) extends DeclarativeAggregate { override lazy val evaluateExpression: AttributeReference = max } + +abstract class AnyAggBase(arg: Expression) --- End diff -- @mgaido91 I had a confusion on where to house this class ? Is it okay if i just rename it and keep it in Max.scala ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228227624 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala --- @@ -57,3 +57,34 @@ case class Max(child: Expression) extends DeclarativeAggregate { override lazy val evaluateExpression: AttributeReference = max } + +abstract class AnyAggBase(arg: Expression) + extends UnevaluableAggrgate with ImplicitCastInputTypes { + + override def children: Seq[Expression] = arg :: Nil + + override def dataType: DataType = BooleanType + + override def inputTypes: Seq[AbstractDataType] = Seq(BooleanType) + + override def checkInputDataTypes(): TypeCheckResult = { +arg.dataType match { + case dt if dt != BooleanType => +TypeCheckResult.TypeCheckFailure(s"Input to function '$prettyName' should have been " + + s"${BooleanType.simpleString}, but it's [${arg.dataType.catalogString}].") + case _ => TypeCheckResult.TypeCheckSuccess +} + } +} + +@ExpressionDescription( + usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.") --- End diff -- I will add. One observation : I see that, only in the API's we specify the @since .. for example for aggregate Max, we have it in api function.scala:Max .. These aggregates are not exposed in the dataset apis and none of the other aggregates seem to have it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22821: [SPARK-25832][SQL] remove newly added map related functi...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22821 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r228020366 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -727,4 +728,67 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { "grouping expressions: [current_date(None)], value: [key: int, value: string], " + "type: GroupBy]")) } + + def getEveryAggColumn(columnName: String): Column = { +Column(new EveryAgg(Column(columnName).expr).toAggregateExpression(false)) --- End diff -- @cloud-fan Ok, let me remove these. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r227901365 --- Diff: sql/core/src/test/resources/sql-tests/inputs/group-by.sql --- @@ -80,3 +80,69 @@ SELECT 1 FROM range(10) HAVING true; SELECT 1 FROM range(10) HAVING MAX(id) > 0; SELECT id FROM range(10) HAVING id > 0; + +-- Test data +CREATE OR REPLACE TEMPORARY VIEW test_agg AS SELECT * FROM VALUES + (1, true), (1, false), + (2, true), + (3, false), (3, null), + (4, null), (4, null), + (5, null), (5, true), (5, false) AS test_agg(k, v); + +-- empty table +SELECT every(v), some(v), any(v) FROM test_agg WHERE 1 = 0; + +-- all null values +SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 4; + +-- aggregates are null Filtering +SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 5; + +-- group by +SELECT k, every(v), some(v), any(v) FROM test_agg GROUP BY k; + +-- having +SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) = false; +SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) IS NULL; + +-- basic subquery path to make sure rewrite happens in both parent and child plans. +SELECT k, + Every(v) AS every +FROM test_agg +WHERE k = 2 + AND v IN (SELECT Any(v) + FROM test_agg + WHERE k = 1) +GROUP BY k; + +-- basic subquery path to make sure rewrite happens in both parent and child plans. +SELECT k, + Every(v) AS every +FROM test_agg +WHERE k = 2 + AND v IN (SELECT Every(v) + FROM test_agg + WHERE k = 1) +GROUP BY k; + +-- input type checking Int +SELECT every(1); + +-- input type checking Short +SELECT some(1S); + +-- input type checking Long +SELECT any(1L); + +-- input type checking String +SELECT every("true"); + +-- every/some/any aggregates are not supported as windows expression. +SELECT k, v, every(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg; --- End diff -- @cloud-fan here are the a few window tests. (fyi) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22809 @cloud-fan Yeah.. I have some tests in group-by.sql. Please take a look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22809#discussion_r227881077 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala --- @@ -38,6 +39,18 @@ object ReplaceExpressions extends Rule[LogicalPlan] { } } +/** --- End diff -- @cloud-fan We could also add these transformations in `ReplaceExpressions` rule and not require a dedicated rule (fyi). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22809: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22809 cc @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22809: [SPARK-19851][SQL] Add support for EVERY and ANY ...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22809 [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates ## What changes were proposed in this pull request? Implements Every, Some, Any aggregates in SQL. These new aggregate expressions are analyzed in normal way and rewritten to equivalent existing aggregate expressions in the optimizer. Every(x) => Min(x) where x is boolean. Some(x) => Max(x) where x is boolean. Any is a synonym for Some. SQL ``` explain extended select every(v) from test_agg group by k; ``` Plan : ``` == Parsed Logical Plan == 'Aggregate ['k], [unresolvedalias('every('v), None)] +- 'UnresolvedRelation `test_agg` == Analyzed Logical Plan == every(v): boolean Aggregate [k#0], [every(v#1) AS every(v)#5] +- SubqueryAlias `test_agg` +- Project [k#0, v#1] +- SubqueryAlias `test_agg` +- LocalRelation [k#0, v#1] == Optimized Logical Plan == Aggregate [k#0], [min(v#1) AS every(v)#5] +- LocalRelation [k#0, v#1] == Physical Plan == *(2) HashAggregate(keys=[k#0], functions=[min(v#1)], output=[every(v)#5]) +- Exchange hashpartitioning(k#0, 200) +- *(1) HashAggregate(keys=[k#0], functions=[partial_min(v#1)], output=[k#0, min#7]) +- LocalTableScan [k#0, v#1] Time taken: 0.512 seconds, Fetched 1 row(s) ``` ## How was this patch tested? Added tests in SQLQueryTestSuite, DataframeAggregateSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-19851-specific-rewrite Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22809.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22809 commit e6c5c84ea189b1fdbac8408594c880950b6b7398 Author: Dilip Biswal Date: 2018-10-16T06:23:41Z tests commit b793d06cb937db300c78e4eb4cd143c385419e57 Author: Dilip Biswal Date: 2018-10-23T16:43:04Z Code changes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22797: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22797 @cloud-fan OK.. thanks a LOT. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22797: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22797 @cloud-fan Sure.. Let me give that a try. Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22797: [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22797 cc @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22797: [SPARK-19851][SQL] Add support for EVERY and ANY ...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22797 [SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates ## What changes were proposed in this pull request? Implements Every, Some, Any aggregates in SQL. Logically these aggregate expressions are mapped to Min and Max, respectively. ``` Every(x) => Min(x) where x is boolean. Some(x) => Max(x) where x is boolean. ``` Any is a synonym for Some. ## How was this patch tested? Added You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-19851-extend Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22797.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22797 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22047#discussion_r227054412 --- Diff: python/pyspark/sql/functions.py --- @@ -403,6 +403,28 @@ def countDistinct(col, *cols): return Column(jc) +def every(col): --- End diff -- @cloud-fan Thank you very much for your response. I will create a new PR based on option-1 today and close this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22772 Thank you @gatorsmile @HyukjinKwon @xuanyuanking --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22772 @HyukjinKwon Sure.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22772 @HyukjinKwon Sure. If there are more changes planned in another PR, then it can take these changes in. Since this went to 2.4, i thought there may be some urgency to fix the links. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22772 cc @dongjoon-hyun @gatorsmile @xuanyuanking --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22772 [SPARK-24499][SQL][DOC][Followup] Fix some broken links ## What changes were proposed in this pull request? Fix some broken links in the new document. I have clicked through all the links. Hopefully i haven't missed any :-) ## How was this patch tested? Built using jekyll and verified the links. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark doc_check Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22772.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22772 commit 2048c39e09a2f73c16e0e1ab4f0124516adf51a3 Author: Dilip Biswal Date: 2018-10-18T23:56:08Z [SPARK-24499][DOC] Fix some broken links --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22670 @srowen Thank you very much. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22047#discussion_r225700828 --- Diff: python/pyspark/sql/functions.py --- @@ -403,6 +403,28 @@ def countDistinct(col, *cols): return Column(jc) +def every(col): --- End diff -- @gatorsmile Hi Sean, I have prepared two branches. One in which these new aggregate functions are extending from the base Max and Min class basically reusing code. The other in which we replace these aggregate expressions in the optimizer. Below are the links. 1. [branch-extend](https://github.com/dilipbiswal/spark/tree/SPARK-19851-extend) 2. [branch-rewrite](https://github.com/dilipbiswal/spark/tree/SPARK-19851-rewrite) I would prefer option 1 because of the following reasons. 1. Code changes are simpler 2. Supports these aggregates as window expressions naturally. In the other option i have to block it. 3. It seems to me for these simple mapping, we probably don't need a rewrite frame work. We could add it in the future if we need a little complex transformation. Please let me know how we want to move forward with this. Thanks !! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22670 @srowen OK, Let me look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22670 @srowen Thanks. Did you mean, the test cases should extend a shared spark context (SharedKafkaSparkContext) which would have this property set ? Actually Sean, there are 3 suites in this directory. `DirectKafkaStreamSuite.scala` `KafkaDataConsumerSuite.scala` and `KafkaRDDSuite.scala`. Given only two tests which are affected by this timeout (which we are fixing here), do you think we need to take on this refactoring work as part of this PR ? Please let me know. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22671: [SPARK-25615][SQL][TEST] Improve the test runtime of Kaf...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22671 Thanks a LOT @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 Thank you very much @srowen @fjh100456 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22671: [SPARK-25615][SQL][TEST] Improve the test runtime...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22671#discussion_r223819177 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala --- @@ -332,7 +332,9 @@ class KafkaSinkSuite extends StreamTest with SharedSQLContext with KafkaTest { var ex: Exception = null try { ex = intercept[StreamingQueryException] { -writer = createKafkaWriter(input.toDF(), withTopic = Some(topic))() +writer = createKafkaWriter(input.toDF(), --- End diff -- @srowen I have used 5 sec timeout for now and i have moved it to createKafkaWriter per ur suggestion. Pl.. let me know if you are okay. Currently there is only one test that make uses of this timeout value. So keep it conservatively at 5 secs ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22671: [SPARK-25615][SQL][TEST] Improve the test runtime...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22671#discussion_r223771001 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala --- @@ -332,7 +332,9 @@ class KafkaSinkSuite extends StreamTest with SharedSQLContext with KafkaTest { var ex: Exception = null try { ex = intercept[StreamingQueryException] { -writer = createKafkaWriter(input.toDF(), withTopic = Some(topic))() +writer = createKafkaWriter(input.toDF(), --- End diff -- @srowen Thank you. I thought about it initially, but went with a conservative fix. > it seems like it's a generally good idea to not let anything block that long in tests I agree. > In fact KafkaContinuousSinkSuite sets it to 1000. We should standardize that too Yeah.. wasn't sure on value to set. If we think 1 second is a good enough timeout setting, i can change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @fjh100456 Yeah. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22671: [SPARK-25615][SQL][TEST] Improve the test runtime of Kaf...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22671 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22671: [SPARK-25615][SQL][TEST] Improve the test runtime...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22671 [SPARK-25615][SQL][TEST] Improve the test runtime of KafkaSinkSuite: streaming write to non-existing topic ## What changes were proposed in this pull request? Specify `kafka.max.block.ms` to 10 seconds while creating the kafka writer. In the absence of this overridden config, by default it uses a default time out of 60 seconds. With this change the test completes in close to 10 seconds as opposed to 1 minute. ## How was this patch tested? This is a test fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25615 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22671.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22671 commit 6a36250dec0e23520f482d8d092cb658ce147fb7 Author: Dilip Biswal Date: 2018-10-08T09:07:47Z [SPARK-25615] Improve the test runtime of KafkaSinkSuite: streaming - write to non-existing topic --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22594: [SPARK-25674][SQL] If the records are incremented by mor...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22594 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test r...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22670 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22670: [SPARK-25631][SPARK-25632][SQL][TEST] Improve the...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22670 [SPARK-25631][SPARK-25632][SQL][TEST] Improve the test runtime of KafkaRDDSuite ## What changes were proposed in this pull request? Set a reasonable poll timeout thats used while consuming topics/partitions from kafka. In the absence of it, a default of 2 minute is used as the timeout values. And all the negative tests take a minimum of 2 minute to execute. After this change, we save about 4 minutes in this suite. ## How was this patch tested? Test fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25631 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22670.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22670 commit 78bf63958e2f85490b2a23fbb249c6da6585d7a6 Author: Dilip Biswal Date: 2018-10-08T05:55:28Z [SPARK-25631][SPARK-25632] Improve the test runtime of KafkaRDDSuite commit 47e0940a0795b85d0b2c34596fa5e21850feb5cb Author: Dilip Biswal Date: 2018-10-08T06:05:06Z doc fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @srowen @gatorsmile Let me see if we can target a low hanging fruit first :-). Given the same input set is used for table and session codecs, we can skip when they both are same value . This way, we can do the inner loop 6 times as opposed to 9 times cutting to total testing combination from 54 to 36. Does this sound reasonable ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @srowen > I would vote against running tests that we think have any value randomly. It's just the wrong way to solve problems, as much as it would be to simply run 90% of our test suites each time on the theory that eventually we'd catch bugs. OK. > If there are, say, 3 codecs, and the point is to test whether one specified codec overrides another, does that really need more than 1 test? is there any reason to believe that override works/doesn't work differently for different codecs? Or are 3 tests sufficient, one to test overriding of each? @gatorsmile Would you have some input here. Can we reduce the codec input set, given we are just testing precedence --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @srowen Thank you for your comments. Actually from a cursory look, i would agree that it does not look that pretty. I also agree that it does look like we are not testing as much as we used to. This testcase was added as part of SPARK-21786 and if we look at what this is trying to test, its basically testing the following the precedence of how compression codec is choosen : ``` val codecName = parameters .get("compression") .orElse(parquetCompressionConf) .getOrElse(sqlConf.parquetCompressionCodec) .toLowerCase(Locale.ROOT) ``` Basically what we are trying to test is , if table and session level codecs are specified which one wins ? So we could have just tested this with 1 value of table codec (say snappy) and one value of session codec say (gzip). But we are trying to be extra cautious and testing a cross product of 3 * 3 combination. It seems to me the 3 values that we have chosen are probably the most commonly used ones. So i wanted to preserve this input set .. but decide which combination to test randomly. Also, like i mentioned above, we have a 6 way loop on top , which mean in 1 run, we would probably pick 6 out of 9 combination of codecs. And in so many runs of jenkins, we will eventually test all the combination that we wanted to test in the first place there by catching any regression that occurs. Given the code that we are trying to test, it would be extremely rare that it would work for one codec combination but fails for another as the logic is codec value agnostic but merely a precedence check. However we are taking a hit for every run , by developers who run on their laptop and all the jenkin runs that happens automatically. So we have a few options : 1) Reduce the input codec list from 3 to 2 (or 1) => number of test combinations goes down to 24 from 54 2) Do what the pr is doing - pick 1 at random => number of test combination is 6 in a given run 3) Do nothing I will do whatever you guys prefer here... Please advice. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/21732 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22500: [SPARK-25488][SQL][TEST] Refactor MiscBenchmark to use m...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22500 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22448 @HyukjinKwon OK.. will do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @mgaido91 Thanks for your input. I took another look at the testcase. Let me outline some of my understandings first. - The test validates the precedence rules in determining the resultant compression to be used in the presence of SessionLevel codecs and Table level codecs. - It verifies the correct compression is picked by reading the metadata information from parquet/orc file metadata. - The accepted configuration for parquet are : none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd - The accepted configuration for orc are : none, uncompressed, snappy, zlib, lzo - The testcase in question use only a SUBSET of allowable codecs for parquet : uncompressed, snappy, gzip - The test case in question use only a SUBSET of allowable codecs for orc : None, Snappy, Zlib One thing to note is that, the codecs being tested are not exhaustive and we pick a subset (perhaps the most popular ones). Other thing is that, we have a 3 way loop 1) isPartitioned 2) convertMetastore 3) useCTAS on top of the codec loop. So we will be calling the codec loop 6 times in a test for each unique combination of (isPartitioned, convertMetastore, useCTAS). And we have changed the codec loop to randomly pick one combination of table level and session level codecs. Given this, i feel we are getting a decent coverage and also i feel we should be able to catch regression as we will catch it in some jenkin run or the other. If you still feel uncomfortable, should we take 2 codecs as opposed to 1 ? It will generate a 24 (4 * 6) times loop as opposed to 54 (9 * 6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22648: [MINOR] Clean up the joinCriteria in SQL parser
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22648 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22047: [SPARK-19851] Add support for EVERY and ANY (SOME...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22047#discussion_r223171963 --- Diff: python/pyspark/sql/functions.py --- @@ -403,6 +403,28 @@ def countDistinct(col, *cols): return Column(jc) +def every(col): --- End diff -- @gatorsmile OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22638 Thanks a lot @gatorsmile @mgaido91 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22644: [SPARK-25626][SQL][TEST] Improve the test execution time...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22644 Thank you very much @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22644: [SPARK-25626][SQL][TEST] Improve the test execution time...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22644 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22642: [SPARK-25613][SPARK-25614][TEST] Add tag ExtendedHiveTes...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22642 @dongjoon-hyun Ah.. got it.. thanks a lot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22644: [SPARK-25626] Improve the test execution time of ...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22644 [SPARK-25626] Improve the test execution time of HiveClientSuites ## What changes were proposed in this pull request? Improve the runtime by reducing the number of partitions created in the test. The number of partitions are reduced from 280 to 60. ## How was this patch tested? Test only. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25626 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22644.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22644 commit ec991dda631d3918fd43d9b2a4f38300767778d3 Author: Dilip Biswal Date: 2018-10-05T16:08:11Z [SPARK-25626] Improve the test execution time of HiveClientSuites --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22642: [SPARK-25613][SPARK-25614][TEST] Add tag ExtendedHiveTes...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22642 @gengliangwang I have a question. What happens when we place this tag ? It is skipped on regular runs ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22047: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22047 @cloud-fan please see my comment [link](https://github.com/apache/spark/pull/22047#issuecomment-411536039). I had tried to rewrite using max and min as suggested by Herman and Reynold in the original pr. I was unable to do it when the aggregate is part of the window. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25612][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @mgaido91 Trying to understand the concern regarding "working combination and a non-working " comment. In my understanding, originally we were doing a cross join between two sets of codecs. so if outer loop has 2 elements and inner has 3 , then we would test 6 combinations. With the current change , in a given run, we would just randomly pick one out of that 6 combination with a hope that another run would pick a different combination out of the 6 possible combination. In case of a failure in this test, we should run the full tests i.e all 6 combination to identify the issue. Please let me know what i could be missing here ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22641: [SPARK-25611][SPARK-25622][SQL][TESTS] Improve test run ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @mgaido91 Could you please take a look ? Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22641: [SPARK-25611][SPARK-25622][SQL][TESTS] Improve te...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22641 [SPARK-25611][SPARK-25622][SQL][TESTS] Improve test run time of CompressionCodecSuite ## What changes were proposed in this pull request? I am using the idea of @mgaido91 to pick the test codecs randomly and selecting 1 in each loop to reduce the test combinations. The test runtime below : ``` [info] - both table-level and session-level compression are set (20 seconds, 607 milliseconds) [info] - table-level compression is not set but session-level compressions is set (14 seconds, 584 milliseconds) [info] - both table-level and session-level compression are set (20 seconds, 140 milliseconds) [info] - table-level compression is not set but session-level compressions is set (14 seconds, 393 milliseconds) [info] - both table-level and session-level compression are set (21 seconds, 410 milliseconds) [info] - table-level compression is not set but session-level compressions is set (14 seconds, 778 milliseconds) ``` ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25611 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22641.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22641 commit 2d7ffa65be68a34a1c80daea207b26ffbb008afd Author: Dilip Biswal Date: 2018-10-05T09:57:46Z [SPARK-25611][SPARK-25622] Improve test run time of CompressionCodecSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org