[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/22141 I reproduced the issue with the following code (was a bit surprised with the behavior) The tables: ```scala scala> spark.sql("SELECT * FROM users").show +---+---+ | id|country| +---+---+ | 0| 10| | 1| 20| +---+---+ scala> spark.sql("SELECT * FROM countries").show +---++ | id|name| +---++ | 10|Portugal| +---++ ``` Without the OR: ```scala scala> spark.sql("SELECT * FROM users u WHERE u.country NOT IN (SELECT id from countries)").show +---+---+ | id|country| +---+---+ | 1| 20| +---+---+ ``` With an OR and IN: scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country IN (SELECT id from countries)").show +---+---+ | id|country| +---+---+ | 0| 10| +---+---+ With an OR and NOT IN: ```scala scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country NOT IN (SELECT id from countries)").show org.apache.spark.sql.AnalysisException: Null-aware predicate sub-queries cannot be used in nested conditions: ((1 = 0) || NOT country#9 IN (list#62 []));; ``` +1 to get that fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r206047653 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala --- @@ -56,9 +57,14 @@ class FailureSafeParser[IN]( } } + private val skipParsing = optimizeEmptySchema && schema.isEmpty def parse(input: IN): Iterator[InternalRow] = { try { - rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null)) + if (skipParsing) { + Iterator.single(InternalRow.empty) --- End diff -- ohh yes my bad! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r206045407 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -450,7 +450,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { input => rawParser.parse(input, createParser, UTF8String.fromString), parsedOptions.parseMode, schema, -parsedOptions.columnNameOfCorruptRecord) +parsedOptions.columnNameOfCorruptRecord, +optimizeEmptySchema = true) --- End diff -- No, no I'm just wondering since you made it a parameter that you can turn off and on, what would be the case to turn it off? If there is none, shouldn't we just get rid of the parameter altogether ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r205974316 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2233,7 +2233,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { .option("multiline", "true") .options(Map("encoding" -> "UTF-16BE")) .json(testFile(fileName)) -.count() +.collect() --- End diff -- just curious why going from count() to collect() here ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r205974224 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -203,19 +203,11 @@ class UnivocityParser( } } - private val doParse = if (requiredSchema.nonEmpty) { --- End diff -- are the changes here https://github.com/apache/spark/pull/21909/files#diff-3a4dc120191f7052e5d98db11934bfb5R63 replacing the need for the `requiredSchema.nonEmpty` check ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r205974275 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala --- @@ -56,9 +57,14 @@ class FailureSafeParser[IN]( } } + private val skipParsing = optimizeEmptySchema && schema.isEmpty def parse(input: IN): Iterator[InternalRow] = { try { - rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null)) + if (skipParsing) { + Iterator.single(InternalRow.empty) --- End diff -- nit: `Iterator.empty` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r205969639 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -450,7 +450,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { input => rawParser.parse(input, createParser, UTF8String.fromString), parsedOptions.parseMode, schema, -parsedOptions.columnNameOfCorruptRecord) +parsedOptions.columnNameOfCorruptRecord, +optimizeEmptySchema = true) --- End diff -- Is the case to turn off `optimizeEmptySchema` multiline JSONs ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21706: [SPARK-24702] Fix Unable to cast to calendar inte...
Github user dmateusp closed the pull request at: https://github.com/apache/spark/pull/21706 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 hey @HyukjinKwon thanks for coming back to me on this :) I'll close the PR now, and start a thread later today on the dev mailing list --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21740#discussion_r202614320 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -165,7 +183,7 @@ class MatrixFactorizationModel @Since("0.8.0") ( */ @Since("1.1.0") def recommendProducts(user: Int, num: Int): Array[Rating] = --- End diff -- Thanks for the explanation Sean :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21740#discussion_r202542318 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -165,7 +183,7 @@ class MatrixFactorizationModel @Since("0.8.0") ( */ @Since("1.1.0") def recommendProducts(user: Int, num: Int): Array[Rating] = --- End diff -- I'm new to the code base, I understand `Either` and `Option` isn't used a lot in public APIs in Spark but shouldn't it be annotated that the functions throws a certain type of exception (being more explicit with the Exception) ? like `@throws(classOf[IllegalArgumentException])` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202539843 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerRuleExclusionSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.internal.SQLConf.OPTIMIZER_EXCLUDED_RULES + + +class OptimizerRuleExclusionSuite extends PlanTest { --- End diff -- Any test case for when a required rule is being passed as a "to be excluded" rule ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202539342 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -175,6 +179,35 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) * Override to provide additional rules for the operator optimization batch. */ def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = Nil + + override def batches: Seq[Batch] = { +val excludedRules = + SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(!_.isEmpty)) +val filteredOptimizationBatches = if (excludedRules.isEmpty) { + optimizationBatches +} else { + optimizationBatches.flatMap { batch => +val filteredRules = + batch.rules.filter { rule => +val exclude = excludedRules.contains(rule.ruleName) +if (exclude) { + logInfo(s"Optimization rule '${rule.ruleName}' is excluded from the optimizer.") +} +!exclude + } +if (batch.rules == filteredRules) { --- End diff -- My understanding is that it is written that way to allow for logging --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202539784 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -127,6 +127,14 @@ object SQLConf { } } + val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules") +.doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " + + "specified by their rule names and separated by comma. It is not guaranteed that all the " + + "rules in this configuration will eventually be excluded, as some rules are necessary " + --- End diff -- I don't understand the optimizer at a low level (I'd be one of those users for which it is a blackbox), would you think it would be feasible to enumerate the rules that cannot be excluded ? Maybe even logging a WARNING when validating the config parameters if it contains required rules --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202538924 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -46,7 +47,23 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) protected def fixedPoint = FixedPoint(SQLConf.get.optimizerMaxIterations) - def batches: Seq[Batch] = { + protected def postAnalysisBatches: Seq[Batch] = { +Batch("Eliminate Distinct", Once, EliminateDistinct) :: +// Technically some of the rules in Finish Analysis are not optimizer rules and belong more +// in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime). +// However, because we also use the analyzer to canonicalized queries (for view definition), --- End diff -- "to canonicalized" -> "to canonicalize" ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21766 Just checked out the PR, ```scala scala> spark.sql("SELECT CAST(1 as NUMERIC)") res0: org.apache.spark.sql.DataFrame = [CAST(1 AS DECIMAL(10,0)): decimal(10,0)] scala> spark.sql("SELECT NUMERIC(1)") org.apache.spark.sql.AnalysisException: Undefined function: 'NUMERIC'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 ``` I imagine some tests could be added here: - `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DataTypeParserSuite.scala` - `sql/core/src/test/resources/sql-tests/inputs/` Do you think it's worth having a separate DataType or just have it as an alias? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 hey @gatorsmile, sorry to bother, could you just clarify the above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 In the current Spark version I can run ```scala scala> spark.sql("SELECT 'interval 1 hour' as a").select(col("a").cast("calendarinterval")).show() ++ | a| ++ |interval 1 hours| ++ ``` while `spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()` throws an exception I am not sure of what this PR is changing as to exposing it as an external data type, it is just making the behavior consistent between raw SQL and DataFrame API Is the plan to remove `CalendarIntervalType` completely ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21706: [SPARK-24702] Fix Unable to cast to calendar inte...
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21706#discussion_r200959306 --- Diff: sql/core/src/test/resources/sql-tests/inputs/cast.sql --- @@ -42,4 +42,23 @@ SELECT CAST('9223372036854775808' AS long); DESC FUNCTION boolean; DESC FUNCTION EXTENDED boolean; + --- End diff -- @kiszk fair enough :) I've added test cases for intervals in the wrong order, additions / substractions of intervals, repeating date keywords, embedded spaces etc.. let me know! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10130: SPARK-12105 - SPARK-SQL add convenient show functions
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/10130 +1 any chance we can revive this PR ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 Could someone review ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 @maropu thanks! I got help on the dev email list as well, I've added sql-tests now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 Just added it to the FunctionRegistry: ```scala scala> spark.sql("DESC function calendarinterval").show(truncate=false) +--+ |function_desc | +--+ |Function: calendarinterval | |Class: org.apache.spark.sql.catalyst.expressions.Cast | |Usage: calendarinterval(expr) - Casts the value `expr` to the target data type `calendarinterval`.| +--+ scala> spark.sql("select calendarinterval('interval 10 days')").show() +--+ |CAST(interval 10 days AS CALENDARINTERVAL)| +--+ | interval 1 weeks ...| +--+ ``` I sent an email to the dev mailing list because I'm having troubles understanding how the sql-tests work (the structure, and how to run them). I'll add tests in there as soon as I figure that out :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 Realized I need to add `calendarinterval` as a function as well to reproduce the behavior of `int`, `date`, `string` etc.. example: ```scala scala> spark.sql("select string(10)").show() +--+ |CAST(10 AS STRING)| +--+ |10| +--+ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21706: [SPARK-24702] Fix Unable to cast to calendar inte...
GitHub user dmateusp opened a pull request: https://github.com/apache/spark/pull/21706 [SPARK-24702] Fix Unable to cast to calendar interval in spark sql ## What changes were proposed in this pull request? Making the `calendarinterval` a parse-able DataType keyword to allow for casting use-cases in SQL ## How was this patch tested? Added a parser test in `sql.caralyst.parser.DataTypeParserSuite` Before: ```DataType calendarinterval is not supported.(line 1, pos 48) == SQL == select cast(cast(interval '1' day as string) as calendarinterval)``` After: ```scala scala> spark.sql("select cast(cast(interval '1' day as string) as calendarinterval)") res0: org.apache.spark.sql.DataFrame = [CAST(CAST(interval 1 days AS STRING) AS CALENDARINTERVAL): calendarinterval] ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/dmateusp/spark SPARK-24702_calendar_interval Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21706.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21706 commit 26c7dd183b41c0fd7da44433afa1be3b68ac5ab1 Author: Daniel Pires Date: 2018-07-03T14:25:03Z Adding CalendarInterval in sql.catalyst.parser; it is a supported data type, but wasn't parseable hence casting to an interval was throwing a ParseException --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org