[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/22141
  
I reproduced the issue with the following code (was a bit surprised with 
the behavior)

The tables:
```scala
scala> spark.sql("SELECT * FROM users").show
+---+---+
| id|country|
+---+---+
|  0| 10|
|  1| 20|
+---+---+


scala> spark.sql("SELECT * FROM countries").show
+---++
| id|name|
+---++
| 10|Portugal|
+---++
```

Without the OR:
```scala
scala> spark.sql("SELECT * FROM users u WHERE u.country NOT IN (SELECT id 
from countries)").show
+---+---+
| id|country|
+---+---+
|  1| 20|
+---+---+
```

With an OR and IN:
scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country IN (SELECT 
id from countries)").show
+---+---+
| id|country|
+---+---+
|  0| 10|
+---+---+

With an OR and NOT IN:
```scala
scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country NOT IN 
(SELECT id from countries)").show
org.apache.spark.sql.AnalysisException: Null-aware predicate sub-queries 
cannot be used in nested conditions: ((1 = 0) || NOT country#9 IN (list#62 
[]));;
```

+1 to get that fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-07-30 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r206047653
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala
 ---
@@ -56,9 +57,14 @@ class FailureSafeParser[IN](
 }
   }
 
+  private val skipParsing = optimizeEmptySchema && schema.isEmpty
   def parse(input: IN): Iterator[InternalRow] = {
 try {
-  rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), 
() => null))
+ if (skipParsing) {
+   Iterator.single(InternalRow.empty)
--- End diff --

ohh yes my bad!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-07-30 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r206045407
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -450,7 +450,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 input => rawParser.parse(input, createParser, 
UTF8String.fromString),
 parsedOptions.parseMode,
 schema,
-parsedOptions.columnNameOfCorruptRecord)
+parsedOptions.columnNameOfCorruptRecord,
+optimizeEmptySchema = true)
--- End diff --

No, no I'm just wondering since you made it a parameter that you can turn 
off and on, what would be the case to turn it off?

If there is none, shouldn't we just get rid of the parameter altogether ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-07-29 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r205974316
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2233,7 +2233,7 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 .option("multiline", "true")
 .options(Map("encoding" -> "UTF-16BE"))
 .json(testFile(fileName))
-.count()
+.collect()
--- End diff --

just curious why going from count() to collect() here ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-07-29 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r205974224
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -203,19 +203,11 @@ class UnivocityParser(
 }
   }
 
-  private val doParse = if (requiredSchema.nonEmpty) {
--- End diff --

are the changes here 
https://github.com/apache/spark/pull/21909/files#diff-3a4dc120191f7052e5d98db11934bfb5R63
 replacing the need for the `requiredSchema.nonEmpty` check ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-07-29 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r205974275
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala
 ---
@@ -56,9 +57,14 @@ class FailureSafeParser[IN](
 }
   }
 
+  private val skipParsing = optimizeEmptySchema && schema.isEmpty
   def parse(input: IN): Iterator[InternalRow] = {
 try {
-  rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), 
() => null))
+ if (skipParsing) {
+   Iterator.single(InternalRow.empty)
--- End diff --

nit: `Iterator.empty`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-07-29 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r205969639
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -450,7 +450,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 input => rawParser.parse(input, createParser, 
UTF8String.fromString),
 parsedOptions.parseMode,
 schema,
-parsedOptions.columnNameOfCorruptRecord)
+parsedOptions.columnNameOfCorruptRecord,
+optimizeEmptySchema = true)
--- End diff --

Is the case to turn off `optimizeEmptySchema` multiline JSONs ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21706: [SPARK-24702] Fix Unable to cast to calendar inte...

2018-07-27 Thread dmateusp
Github user dmateusp closed the pull request at:

https://github.com/apache/spark/pull/21706


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-27 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
hey @HyukjinKwon thanks for coming back to me on this :)

I'll close the PR now, and start a thread later today on the dev mailing 
list


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...

2018-07-16 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21740#discussion_r202614320
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -165,7 +183,7 @@ class MatrixFactorizationModel @Since("0.8.0") (
*/
   @Since("1.1.0")
   def recommendProducts(user: Int, num: Int): Array[Rating] =
--- End diff --

Thanks for the explanation Sean :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...

2018-07-15 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21740#discussion_r202542318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -165,7 +183,7 @@ class MatrixFactorizationModel @Since("0.8.0") (
*/
   @Since("1.1.0")
   def recommendProducts(user: Int, num: Int): Array[Rating] =
--- End diff --

I'm new to the code base, I understand `Either` and `Option` isn't used a 
lot in public APIs in Spark but shouldn't it be annotated that the functions 
throws a certain type of exception (being more explicit with the Exception) ?

like `@throws(classOf[IllegalArgumentException])` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion

2018-07-15 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21764#discussion_r202539843
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerRuleExclusionSuite.scala
 ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf.OPTIMIZER_EXCLUDED_RULES
+
+
+class OptimizerRuleExclusionSuite extends PlanTest {
--- End diff --

Any test case for when a required rule is being passed as a "to be 
excluded" rule ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion

2018-07-15 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21764#discussion_r202539342
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -175,6 +179,35 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
* Override to provide additional rules for the operator optimization 
batch.
*/
   def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = Nil
+
+  override def batches: Seq[Batch] = {
+val excludedRules =
+  
SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(!_.isEmpty))
+val filteredOptimizationBatches = if (excludedRules.isEmpty) {
+  optimizationBatches
+} else {
+  optimizationBatches.flatMap { batch =>
+val filteredRules =
+  batch.rules.filter { rule =>
+val exclude = excludedRules.contains(rule.ruleName)
+if (exclude) {
+  logInfo(s"Optimization rule '${rule.ruleName}' is excluded 
from the optimizer.")
+}
+!exclude
+  }
+if (batch.rules == filteredRules) {
--- End diff --

My understanding is that it is written that way to allow for logging


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion

2018-07-15 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21764#discussion_r202539784
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -127,6 +127,14 @@ object SQLConf {
 }
   }
 
+  val OPTIMIZER_EXCLUDED_RULES = 
buildConf("spark.sql.optimizer.excludedRules")
+.doc("Configures a list of rules to be disabled in the optimizer, in 
which the rules are " +
+  "specified by their rule names and separated by comma. It is not 
guaranteed that all the " +
+  "rules in this configuration will eventually be excluded, as some 
rules are necessary " +
--- End diff --

I don't understand the optimizer at a low level (I'd be one of those users 
for which it is a blackbox), would you think it would be feasible to enumerate 
the rules that cannot be excluded ? Maybe even logging a WARNING when 
validating the config parameters if it contains required rules


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion

2018-07-15 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21764#discussion_r202538924
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -46,7 +47,23 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
 
   protected def fixedPoint = FixedPoint(SQLConf.get.optimizerMaxIterations)
 
-  def batches: Seq[Batch] = {
+  protected def postAnalysisBatches: Seq[Batch] = {
+Batch("Eliminate Distinct", Once, EliminateDistinct) ::
+// Technically some of the rules in Finish Analysis are not optimizer 
rules and belong more
+// in the analyzer, because they are needed for correctness (e.g. 
ComputeCurrentTime).
+// However, because we also use the analyzer to canonicalized queries 
(for view definition),
--- End diff --

"to canonicalized" -> "to canonicalize" ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric

2018-07-15 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21766
  
Just checked out the PR,

```scala
scala> spark.sql("SELECT CAST(1 as NUMERIC)")
res0: org.apache.spark.sql.DataFrame = [CAST(1 AS DECIMAL(10,0)): 
decimal(10,0)]

scala> spark.sql("SELECT NUMERIC(1)")
org.apache.spark.sql.AnalysisException: Undefined function: 'NUMERIC'. This 
function is neither a registered temporary function nor a permanent function 
registered in the database 'default'.; line 1 pos 7
```

I imagine some tests could be added here:
  - 
`sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DataTypeParserSuite.scala`
  - `sql/core/src/test/resources/sql-tests/inputs/`

Do you think it's worth having a separate DataType or just have it as an 
alias?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-15 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
hey @gatorsmile, sorry to bother, could you just clarify the above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-10 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
In the current Spark version I can run

```scala
scala> spark.sql("SELECT 'interval 1 hour' as 
a").select(col("a").cast("calendarinterval")).show()
++
|   a|
++
|interval 1 hours|
++
```

while `spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()` 
throws an exception

I am not sure of what this PR is changing as to exposing it as an external 
data type, it is just making the behavior consistent between raw SQL and 
DataFrame API

Is the plan to remove `CalendarIntervalType` completely ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21706: [SPARK-24702] Fix Unable to cast to calendar inte...

2018-07-09 Thread dmateusp
Github user dmateusp commented on a diff in the pull request:

https://github.com/apache/spark/pull/21706#discussion_r200959306
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/cast.sql ---
@@ -42,4 +42,23 @@ SELECT CAST('9223372036854775808' AS long);
 
 DESC FUNCTION boolean;
 DESC FUNCTION EXTENDED boolean;
+
--- End diff --

@kiszk fair enough :) I've added test cases for intervals in the wrong 
order, additions / substractions of intervals, repeating date keywords, 
embedded spaces etc..

let me know!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10130: SPARK-12105 - SPARK-SQL add convenient show functions

2018-07-06 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/10130
  
+1 any chance we can revive this PR ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-05 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
Could someone review ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-04 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
@maropu thanks! I got help on the dev email list as well, I've added 
sql-tests now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-03 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
Just added it to the FunctionRegistry:
```scala
scala> spark.sql("DESC function calendarinterval").show(truncate=false)

+--+
|function_desc  
   |

+--+
|Function: calendarinterval 
   |
|Class: org.apache.spark.sql.catalyst.expressions.Cast  
   |
|Usage: calendarinterval(expr) - Casts the value `expr` to the target data 
type `calendarinterval`.|

+--+


scala> spark.sql("select calendarinterval('interval 10 days')").show()
+--+
|CAST(interval 10 days AS CALENDARINTERVAL)|
+--+
|  interval 1 weeks ...|
+--+
```

 I sent an email to the dev mailing list because I'm having troubles 
understanding how the sql-tests work (the structure, and how to run them). I'll 
add tests in there as soon as I figure that out :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-03 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
Realized I need to add `calendarinterval` as a function as well to 
reproduce the behavior of `int`, `date`, `string` etc..

example:
```scala
scala> spark.sql("select string(10)").show()
+--+
|CAST(10 AS STRING)|
+--+
|10|
+--+
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...

2018-07-03 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/21706
  
sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21706: [SPARK-24702] Fix Unable to cast to calendar inte...

2018-07-03 Thread dmateusp
GitHub user dmateusp opened a pull request:

https://github.com/apache/spark/pull/21706

[SPARK-24702] Fix Unable to cast to calendar interval in spark sql

## What changes were proposed in this pull request?

Making the `calendarinterval` a parse-able DataType keyword to allow for 
casting use-cases in SQL

## How was this patch tested?

Added a parser test in `sql.caralyst.parser.DataTypeParserSuite`

Before:
```DataType calendarinterval is not supported.(line 1, pos 48)

== SQL ==
select cast(cast(interval '1' day as string) as calendarinterval)```

After:
```scala
scala> spark.sql("select cast(cast(interval '1' day as string) as 
calendarinterval)")
res0: org.apache.spark.sql.DataFrame = [CAST(CAST(interval 1 days AS 
STRING) AS CALENDARINTERVAL): calendarinterval]
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dmateusp/spark SPARK-24702_calendar_interval

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21706.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21706


commit 26c7dd183b41c0fd7da44433afa1be3b68ac5ab1
Author: Daniel Pires 
Date:   2018-07-03T14:25:03Z

Adding CalendarInterval in sql.catalyst.parser; it is a supported data 
type, but wasn't parseable hence casting to an interval was throwing a 
ParseException




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org