[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14116
  
Thank you so much, @gatorsmile . And, sorry for late response. Definitely, 
I have many things to do. Now, it's my turn. Let's see how much I can handle 
them. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14235
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14235
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62448/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14235
  
**[Test build #62450 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62450/consoleFull)**
 for PR 14235 at commit 
[`a9a1b00`](https://github.com/apache/spark/commit/a9a1b00411fa6175dd1e628a3485a4313999b636).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14098
  
Please be careful with case sensitivity. It broke the release candidate 
last time.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14235
  
**[Test build #62448 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62448/consoleFull)**
 for PR 14235 at commit 
[`38f52ce`](https://github.com/apache/spark/commit/38f52cea6fbd53cf66429d396663708afcb1f193).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14179: [SPARK-16055][SPARKR] warning added while using sparkPac...

2016-07-17 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14179
  
LGTM.

btw, for future references, you could either
```
# sep defaults to " "
paste("sparkPackages has no effect when using spark-submit or sparkR 
shell,",
 "please use the --packages commandline instead")
```
or
```
paste0("sparkPackages has no effect when using spark-submit or sparkR 
shell, ",
 "please use the --packages commandline instead")
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14065: [SPARK-14743][YARN] Add a configurable token mana...

2016-07-17 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r71104553
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -390,8 +393,22 @@ private[spark] class Client(
 // Upload Spark and the application JAR to the remote file system if 
necessary,
 // and add them as local resources to the application master.
 val fs = destDir.getFileSystem(hadoopConf)
-val nns = YarnSparkHadoopUtil.get.getNameNodesToAccess(sparkConf) + 
destDir
-YarnSparkHadoopUtil.get.obtainTokensForNamenodes(nns, hadoopConf, 
credentials)
+
+// Merge credentials obtained from registered providers
+var nearestTimeOfNextRenewal = Long.MaxValue
+credentialManager.obtainCredentials(hadoopConf, credentials).foreach { 
t =>
--- End diff --

Yes, in the current implementation of `HDFSCredentialProvider`, time of 
next token renewal will only be calculated when principal is provided 
(`loginFromKeytab` is true).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

2016-07-17 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14207
  
Does it mean that if users do not issue refresh when the table location is 
changed, the schema will be wrong when the Spark is re-starting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14243: [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include j...

2016-07-17 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14243
  
Looks good, interesting approach - this won't run in CRAN since this is 
part of Scala tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14244
  
cc @hvanhovell Is this test included in the other suite? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14245
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14245
  
**[Test build #62449 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62449/consoleFull)**
 for PR 14245 at commit 
[`e797c30`](https://github.com/apache/spark/commit/e797c30b94fb191b844ab7ef247a84baf5260379).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14245
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62449/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14243: [SPARK-10683][SPARK-16510][SPARKR] Move SparkR in...

2016-07-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14243#discussion_r71104052
  
--- Diff: R/pkg/inst/tests/testthat/jarTest.R ---
@@ -16,17 +16,17 @@
 #
 library(SparkR)
 
-sparkR.session()
+sc <- sparkR.session()
--- End diff --

probably left behind from merge - `sc` was not referenced and was removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14207
  
@viirya The problem it tries to resolve is from the comment of @rxin in 
another PR: https://github.com/apache/spark/pull/14148#issuecomment-232273833


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14116
  
Finished my first pass. The major concern is handling of 
`INFORMATION_SCHEMA` is not clean to me. It looks hacky. Many holes are caused 
by it. More test cases are needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-17 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/14098
  
@wangmiao1981 I guess it's not ready yet. You may put a `[WIP]` tag in the 
PR title when it's in WIP status and remove it when it is ready for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...

2016-07-17 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14098#discussion_r71103329
  
--- Diff: docs/sql-programming-guide.md ---
@@ -79,7 +79,7 @@ The entry point into all functionality in Spark is the 
[`SparkSession`](api/java
 
 The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
 
-{% include_example init_session python/sql.py %}
+{% include_example init_session python/SparkSQLExample.py %}
--- End diff --

And the file path is wrong. The actual path is 
`python/sql/SparkSqlExample.py`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14150: [SPARK-16494] [ML] Upgrade breeze version to 0.12

2016-07-17 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/14150
  
@srowen I found no obvious compatibility issues after reading the release 
notes. If this looks good, please let it get in, since 
[SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181) depends upon 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14116#discussion_r71103250
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/systemcatalog/InformationSchema.scala
 ---
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.systemcatalog
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.{DEFAULT_DATABASE, 
INFORMATION_SCHEMA_DATABASE}
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.Project
+import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.internal.CatalogImpl
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types._
+
+/**
+ * INFORMATION_SCHEMA is a database consisting views which provide 
information about all of the
+ * tables, views, columns in a database.
+ *
+ * These views are designed to be populated by this package in order to be 
independent from
+ * Spark catalog. To keep minimal dependency, currently INFORMATION_SCHEMA 
views are implemented as
+ * Spark temporary views with a database prefix: 
`SessionCatalog.INFORMATION_SCHEMA_DATABASE`.
+ *
+ * The following is the class hierarchy in this package rooted at 
InformationSchemaRelationProvider.
+ *
+ * InformationSchemaRelationProvider
+ *  -> DatabasesRelationProvider
+ *  -> TablesRelationProvider
+ *  -> ViewsRelationProvider
+ *  -> ColumnsRelationProvider
+ *  -> SessionVariablesRelationProvider
+ */
+
+/**
+ * InformationSchema object provides bootstrap and utility functions.
+ */
+object InformationSchema {
+
+  /**
+   * Register INFORMATION_SCHEMA database. SessionCatalog.catalog invokes 
this function.
+   */
+  def registerInformationSchema(sparkSession: SparkSession): Unit = {
+sparkSession.sql(s"CREATE DATABASE IF NOT EXISTS 
$INFORMATION_SCHEMA_DATABASE")
+registerView(sparkSession, new DatabasesRelationProvider, 
Seq("schemata", "databases"))
+registerView(sparkSession, new TablesRelationProvider, Seq("tables"))
+registerView(sparkSession, new ViewsRelationProvider, Seq("views"))
+registerView(sparkSession, new ColumnsRelationProvider, Seq("columns"))
+registerView(sparkSession, new SessionVariablesRelationProvider, 
Seq("session_variables"))
+  }
+
+  /**
+   * Register an INFORMATION_SCHEMA relation provider as a temporary view 
of Spark Catalog.
+   */
+  private def registerView(
+  sparkSession: SparkSession,
+  relationProvider: SchemaRelationProvider,
+  names: Seq[String]) {
+val plan =
+  
LogicalRelation(relationProvider.createRelation(sparkSession.sqlContext, null, 
null)).analyze
+val projectList = plan.output.zip(plan.schema).map {
+  case (attr, col) => Alias(attr, col.name)()
+}
+sparkSession.sessionState.executePlan(Project(projectList, plan))
+
+for (name <- names) {
+  
sparkSession.sessionState.catalog.createTempView(s"$INFORMATION_SCHEMA_DATABASE.$name",
+plan, overrideIfExists = true)
+}
+  }
+
+  /**
+   * Only EqualTo filters are handled in INFORMATION_SCHEMA data sources.
+   */
+  def unhandledFilters(filters: Array[Filter], columnName: String): 
Array[Filter] = {
+import org.apache.spark.sql.sources.EqualTo
+filters.filter {
+  case EqualTo(attribute, _) if attribute.equalsIgnoreCase(columnName) 
=> false
+  case _ => true
+}
+  }
+
+  /**
+   * Return `EqualTo` filtered DataFrame.
+   */
+  def getFilteredTables(sparkSession: SparkSession, filters: 
Seq[Expression], columnName: String)
+  : DataFrame = {
+import org.apache.sp

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-17 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/14098
  
@wangmiao1981 Is this ready for review now? Also, please update the PR 
title to:

```
[SPARK-16380][SQL][EXAMPLE] Update SQL examples and programming guide for 
Python language binding
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...

2016-07-17 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14098#discussion_r71103044
  
--- Diff: docs/sql-programming-guide.md ---
@@ -79,7 +79,7 @@ The entry point into all functionality in Spark is the 
[`SparkSession`](api/java
 
 The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
 
-{% include_example init_session python/sql.py %}
+{% include_example init_session python/SparkSQLExample.py %}
--- End diff --

The original file name `sql.py` should be fine. Actually, the new name 
`SparkSQLExample.py` doesn't conform to Python convention. (You may check other 
file names under `examples/src/main/python`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite ...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71103029
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

While making that, I feel this becomes much convenient. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

2016-07-17 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14207
  
I think it is not clear what the problem this PR tries to solve is. It just 
says it proposes to save the inferred schema in external catalog.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update

2016-07-17 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14245
  
LGTM. Can we reuse a existing jira number?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14245
  
**[Test build #62449 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62449/consoleFull)**
 for PR 14245 at commit 
[`e797c30`](https://github.com/apache/spark/commit/e797c30b94fb191b844ab7ef247a84baf5260379).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14169: [SPARK-16515][SQL]set default record reader and w...

2016-07-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14169#discussion_r71102534
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -1340,10 +1340,17 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 val name = conf.getConfString("hive.script.serde",
   "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")
 val props = Seq("field.delim" -> "\t")
-val recordHandler = Try(conf.getConfString(configKey)).toOption
+val recordHandler = defaultRecordHandler(configKey)
 (Nil, Option(name), props, recordHandler)
 }
 
+def defaultRecordHandler(configKey: String): Option[String] = {
+  Try(conf.getConfString(configKey)).orElse(Try(configKey match {
+case "hive.script.recordreader" => 
"org.apache.hadoop.hive.ql.exec.TextRecordReader"
+case "hive.script.recordwriter" => 
"org.apache.hadoop.hive.ql.exec.TextRecordWriter"
+  })).toOption
+}
+
 val (inFormat, inSerdeClass, inSerdeProps, reader) =
   format(inRowFormat, "hive.script.recordreader")
--- End diff --

Can we pass in the default value for the reader/writer? like 
`format(inRowFormat, "hive.script.recordreader", 
"org.apache.hadoop.hive.ql.exec.TextRecordReader")` and `format(outRowFormat, 
"hive.script.recordwriter", 
"org.apache.hadoop.hive.ql.exec.TextRecordWriter")`. Then, in `def format`, we 
just use `getConfString(key: String, defaultValue: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example updat...

2016-07-17 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/14245

[MINOR][DOCS][EXAMPLES] Minor Scala example update

## What changes were proposed in this pull request?

This PR moves one and the last hard-coded Scala example snippet from the 
SQL programming guide into `SparkSqlExample.scala`.

## How was this patch tested?

Manually verified the generated HTML page.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark minor-scala-example-update

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14245


commit e797c30b94fb191b844ab7ef247a84baf5260379
Author: Cheng Lian 
Date:   2016-07-18T06:00:33Z

Minor Scala example update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite ...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71102449
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

Yep.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14116#discussion_r71102423
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/systemcatalog/InformationSchema.scala
 ---
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.systemcatalog
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.{DEFAULT_DATABASE, 
INFORMATION_SCHEMA_DATABASE}
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.Project
+import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.internal.CatalogImpl
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types._
+
+/**
+ * INFORMATION_SCHEMA is a database consisting views which provide 
information about all of the
+ * tables, views, columns in a database.
+ *
+ * These views are designed to be populated by this package in order to be 
independent from
+ * Spark catalog. To keep minimal dependency, currently INFORMATION_SCHEMA 
views are implemented as
+ * Spark temporary views with a database prefix: 
`SessionCatalog.INFORMATION_SCHEMA_DATABASE`.
+ *
+ * The following is the class hierarchy in this package rooted at 
InformationSchemaRelationProvider.
+ *
+ * InformationSchemaRelationProvider
+ *  -> DatabasesRelationProvider
+ *  -> TablesRelationProvider
+ *  -> ViewsRelationProvider
+ *  -> ColumnsRelationProvider
+ *  -> SessionVariablesRelationProvider
+ */
+
+/**
+ * InformationSchema object provides bootstrap and utility functions.
+ */
+object InformationSchema {
+
+  /**
+   * Register INFORMATION_SCHEMA database. SessionCatalog.catalog invokes 
this function.
+   */
+  def registerInformationSchema(sparkSession: SparkSession): Unit = {
+sparkSession.sql(s"CREATE DATABASE IF NOT EXISTS 
$INFORMATION_SCHEMA_DATABASE")
+registerView(sparkSession, new DatabasesRelationProvider, 
Seq("schemata", "databases"))
+registerView(sparkSession, new TablesRelationProvider, Seq("tables"))
+registerView(sparkSession, new ViewsRelationProvider, Seq("views"))
+registerView(sparkSession, new ColumnsRelationProvider, Seq("columns"))
+registerView(sparkSession, new SessionVariablesRelationProvider, 
Seq("session_variables"))
+  }
+
+  /**
+   * Register an INFORMATION_SCHEMA relation provider as a temporary view 
of Spark Catalog.
+   */
+  private def registerView(
+  sparkSession: SparkSession,
+  relationProvider: SchemaRelationProvider,
+  names: Seq[String]) {
+val plan =
+  
LogicalRelation(relationProvider.createRelation(sparkSession.sqlContext, null, 
null)).analyze
--- End diff --

`null, null` -> `parameters = null`, schema = null`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14065: [SPARK-14743][YARN] Add a configurable token mana...

2016-07-17 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r71102254
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/security/ConfigurableCredentialManager.scala
 ---
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.security
+
+import java.util.ServiceLoader
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.Credentials
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Utils
+
+/**
+ * A ConfigurableCredentialManager to manage all the registered credential 
providers and offer
+ * APIs for other modules to obtain credentials as well as renewal time. 
By default
+ * [[HDFSCredentialProvider]], [[HiveCredentialProvider]] and 
[[HBaseCredentialProvider]] will
+ * be loaded in, any plugged-in credential provider wants to be managed by
+ * ConfigurableCredentialManager needs to implement 
[[ServiceCredentialProvider]] interface and put
+ * into resources to be loaded by ServiceLoader.
+ *
+ * Also the specific credential provider is controlled by
+ * spark.yarn.security.credentials.{service}.enabled, it will not be 
loaded in if set to false.
+ */
+final class ConfigurableCredentialManager private[yarn] (sparkConf: 
SparkConf) extends Logging {
+  private val deprecatedProviderEnabledConfig = 
"spark.yarn.security.tokens.%s.enabled"
+  private val providerEnabledConfig = 
"spark.yarn.security.credentials.%s.enabled"
+
+  // Maintain all the registered credential providers
+  private val credentialProviders = mutable.HashMap[String, 
ServiceCredentialProvider]()
+
+  // Default crendetial providers that will be loaded automatically, 
unless specifically disabled.
+  private val defaultCredentialProviders = Map(
--- End diff --

Sorry my fault, it is not necessary now, I will remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...

2016-07-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14235
  
Can you also remove "[TEST]" from the title? TEST isn't a module.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14065: [SPARK-14743][YARN] Add a configurable token mana...

2016-07-17 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r71102195
  
--- Diff: dev/.rat-excludes ---
@@ -99,3 +99,4 @@ spark-deps-.*
 .*tsv
 org.apache.spark.scheduler.ExternalClusterManager
 .*\.sql
+org.apache.spark.deploy.yarn.security.ServiceCredentialProvider
--- End diff --

I followed the convention of other services like `DataSource` which also 
needs to be excluded from rat check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71102174
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

Also instead of using a variable as a flag, maybe we can use an 
environmental variable, and then document in the beginning

```
* To re-generate golden files, run:  SPARK_GENERATE_GOLDEN_FILES=1 
build/sbt "hive/test-Only * LogicalPlanToSQLSuite"
```

something like this?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14116#discussion_r71102142
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/systemcatalog/InformationSchema.scala
 ---
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.systemcatalog
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.{DEFAULT_DATABASE, 
INFORMATION_SCHEMA_DATABASE}
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.Project
+import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.internal.CatalogImpl
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types._
+
+/**
+ * INFORMATION_SCHEMA is a database consisting views which provide 
information about all of the
+ * tables, views, columns in a database.
+ *
+ * These views are designed to be populated by this package in order to be 
independent from
+ * Spark catalog. To keep minimal dependency, currently INFORMATION_SCHEMA 
views are implemented as
+ * Spark temporary views with a database prefix: 
`SessionCatalog.INFORMATION_SCHEMA_DATABASE`.
+ *
+ * The following is the class hierarchy in this package rooted at 
InformationSchemaRelationProvider.
+ *
+ * InformationSchemaRelationProvider
+ *  -> DatabasesRelationProvider
+ *  -> TablesRelationProvider
+ *  -> ViewsRelationProvider
+ *  -> ColumnsRelationProvider
+ *  -> SessionVariablesRelationProvider
+ */
+
+/**
+ * InformationSchema object provides bootstrap and utility functions.
+ */
+object InformationSchema {
+
+  /**
+   * Register INFORMATION_SCHEMA database. SessionCatalog.catalog invokes 
this function.
+   */
+  def registerInformationSchema(sparkSession: SparkSession): Unit = {
+sparkSession.sql(s"CREATE DATABASE IF NOT EXISTS 
$INFORMATION_SCHEMA_DATABASE")
--- End diff --

I am lost. A database `INFORMATION_SCHEMA_DATABASE` is created through a 
SQL command, but the function `databaseExists` always returns true for the 
database for `INFORMATION_SCHEMA_DATABASE`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71102107
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

Ah ic. My worry is that Maven and SBT has base path where the tests are 
running, and the test will fail in Maven.

How about we define the output directory as a variable in the beginning of 
the suite, and then use that when we write output out? For reading input we 
simply use "sqlgen/$answerFile.sql"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14236: [SPARK-16588][SQL] Missed API fix for a function ...

2016-07-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14236


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14235
  
Now, the remaining issue is using `getResource` to save the golden files. I 
left a comment about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14235
  
**[Test build #62448 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62448/consoleFull)**
 for PR 14235 at commit 
[`38f52ce`](https://github.com/apache/spark/commit/38f52cea6fbd53cf66429d396663708afcb1f193).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14236: [SPARK-16588][SQL] Missed API fix for a function name mi...

2016-07-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14236
  
@rxin Sure, Thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14116
  
```Scala
val catalog = spark.sessionState.catalog

catalog.setCurrentDatabase(SessionCatalog.INFORMATION_SCHEMA_DATABASE)
sql("CREATE TABLE my_tab (age INT, name STRING)")
```

We can set the current database to `INFORMATION_SCHEMA_DATABASE`, but we 
got an error when we are trying to create a table in this database.
```
Database 'information_schema' not found;
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
'information_schema' not found;
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71101786
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

For this one, here is the code and the result in IntelliJ.
```
val path = s"src/test/resources/sqlgen/$answerFile.sql"
val path2 = getClass.getClassLoader.getResource("sqlgen/README.md")
```
```

file:/Users/dongjoon/SPARK-CHECK-GEN-SQL/sql/hive/target/scala-2.11/test-classes/sqlgen/README.md
```
Reading resource files are not problem, but we need the src directory for 
saving new golden SQL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14236: [SPARK-16588][SQL] Missed API fix for a function name mi...

2016-07-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14236
  
Since it is a small change I will just push a commit myself. Thanks for 
noticing it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14236: [SPARK-16588][SQL] Missed API fix for a function ...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14236#discussion_r71101740
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -977,7 +977,10 @@ object functions {
*
* @group normal_funcs
* @since 1.4.0
+   * @deprecated As of 2.0.0, replaced by `monotonically_increasing_id`. 
This will be removed in
+   *Spark 2.1.
*/
+  @deprecated("Use monotonically_increasing_id. This will be removed in 
Spark 2.1.", "2.0.0")
--- End diff --

we probably don't want to remove it in 2.1. can you just say "Use 
monotonically_increasing_id()"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71101535
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

I'll try again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71101452
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

What I mean is `sqlgen` is currently `sql/hive/src/test/resources/sqlgen`. 
So, `sql/hive/src/test/resources/sqlgen/README.md` is also copied like the 
other resources into `test-classes`. Probably, did I miss something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14116#discussion_r71101399
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -484,8 +521,11 @@ class SessionCatalog(
 val dbTables =
   externalCatalog.listTables(dbName, pattern).map { t => 
TableIdentifier(t, Some(dbName)) }
 synchronized {
-  val _tempTables = StringUtils.filterPattern(tempTables.keys.toSeq, 
pattern)
+  var _tempTables = StringUtils.filterPattern(tempTables.keys.toSeq, 
pattern)
 .map { t => TableIdentifier(t) }
+  if (db != INFORMATION_SCHEMA_DATABASE) {
--- End diff --

This part is not clear to me. 
What happens if users want to list tables in database 
`INFORMATION_SCHEMA_DATABASE`, it will return empty?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71101198
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
|# Converted SQL query string:
|$convertedSQL
|
-   |# Original HiveQL query string:
-   |$hiveQl
+   |# Original SQL query string:
+   |$sqlString
|
|# Resolved query plan:
|${df.queryExecution.analyzed.treeString}
  """.stripMargin, cause)
 }
   }
 
+  test("checkSQL") {
--- End diff --

Oh, I see. I'll fix like that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71101205
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

What do you mean? What I meant is to create a README.md file in sqlgen 
directory, and at runtime we can get the location of sqlgen directory by 
getResource("sqlgen/README.md")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71101173
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +88,58 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
+  val separator = "-" * 80
+  if (saveQuery) {
+val header = "-- This file is automatically generated by 
LogicalPlanToSQLSuite."
+val answerText = 
s"$header\n${originalSQL.trim()}\n${separator}\n$normalizedGenSQL\n"
+Files.write(Paths.get(path), 
answerText.getBytes(StandardCharsets.UTF_8))
+  } else {
+val answerText = new String(Files.readAllBytes(Paths.get(path)), 
StandardCharsets.UTF_8)
+val sqls = answerText.split(separator)
+if (sqls.length == 2) {
--- End diff --

No problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71101148
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
--- End diff --

Ur, I think any resource files under `resources` are copied into 
`test-classes` with classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14244
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62447/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14244
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14244
  
**[Test build #62447 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62447/consoleFull)**
 for PR 14244 at commit 
[`871d358`](https://github.com/apache/spark/commit/871d358b3b24b5b2acf46f5d882f4b8614fdd670).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14116#discussion_r71101023
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -471,6 +480,34 @@ class SessionCatalog(
   }
 
   /**
+   * Normalize TableIdentifier by consistently ensuring the following two 
rules.
+   * 1. System-side temporary tables should have None as database.
+   * 2. System-side temporary tables should have prefixed table names.
+   * Currently, only INFORMATION_SCHEMA tables are system-side temporary 
tables, and this function
+   * returns TableIdentifier("information_schema.databases", None).
+   */
+  protected def normalizeTableIdentifier(name: TableIdentifier): 
TableIdentifier = synchronized {
+if (name.database.isDefined) {
+  if (name.database.contains(INFORMATION_SCHEMA_DATABASE)) {
+TableIdentifier(s"$INFORMATION_SCHEMA_DATABASE.${name.table}", 
None)
+  } else {
+name
+  }
+} else {
+  val tableName = formatTableName(name.table)
+  if (tableName.startsWith(INFORMATION_SCHEMA_DATABASE + ".")) {
--- End diff --

Can you explain which cases will enter this processing? Is that possible we 
could hit backtick-quoted `INFORMATION_SCHEMA_DATABASE ` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14116#discussion_r71100937
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -471,6 +480,34 @@ class SessionCatalog(
   }
 
   /**
+   * Normalize TableIdentifier by consistently ensuring the following two 
rules.
+   * 1. System-side temporary tables should have None as database.
+   * 2. System-side temporary tables should have prefixed table names.
+   * Currently, only INFORMATION_SCHEMA tables are system-side temporary 
tables, and this function
+   * returns TableIdentifier("information_schema.databases", None).
--- End diff --

Please correct this description. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71100950
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
|# Converted SQL query string:
|$convertedSQL
|
-   |# Original HiveQL query string:
-   |$hiveQl
+   |# Original SQL query string:
+   |$sqlString
|
|# Resolved query plan:
|${df.queryExecution.analyzed.treeString}
  """.stripMargin, cause)
 }
   }
 
+  test("checkSQL") {
--- End diff --

Actually maybe these should be 4 test cases instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71100943
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
|# Converted SQL query string:
|$convertedSQL
|
-   |# Original HiveQL query string:
-   |$hiveQl
+   |# Original SQL query string:
+   |$sqlString
|
|# Resolved query plan:
|${df.queryExecution.analyzed.treeString}
  """.stripMargin, cause)
 }
   }
 
+  test("checkSQL") {
--- End diff --

what i meant about inline comment was something like this

```
test("test behavior for the golden file test harness") {
  // 1. Test should fail if the SQL query cannot be parsed.
  val m = intercept[ParseException] {
...
  }
  assert(...)

  // 2. Test should fail if the golden file cannot be found
  ...

}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org




[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71100930
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
--- End diff --

Sure. I think so. It would be great issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14116
  
General question 1: Should we disallow users to do most DDLs against 
`INFORMATION_SCHEMA`? For example, create functions, create tables, ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71100872
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
|# Converted SQL query string:
|$convertedSQL
|
-   |# Original HiveQL query string:
-   |$hiveQl
+   |# Original SQL query string:
+   |$sqlString
|
|# Resolved query plan:
|${df.queryExecution.analyzed.treeString}
  """.stripMargin, cause)
 }
   }
 
+  test("checkSQL") {
--- End diff --

add some comment inline explaining what we are actually testing here - also 
maybe rename the test name to "test behavior for the golden file test harness"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71100858
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +88,58 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
+if (answerFile != null) {
+  val path = s"src/test/resources/sqlgen/$answerFile.sql"
+  val separator = "-" * 80
+  if (saveQuery) {
+val header = "-- This file is automatically generated by 
LogicalPlanToSQLSuite."
+val answerText = 
s"$header\n${originalSQL.trim()}\n${separator}\n$normalizedGenSQL\n"
+Files.write(Paths.get(path), 
answerText.getBytes(StandardCharsets.UTF_8))
+  } else {
+val answerText = new String(Files.readAllBytes(Paths.get(path)), 
StandardCharsets.UTF_8)
+val sqls = answerText.split(separator)
+if (sqls.length == 2) {
--- End diff --

shouldn't we do an assert here instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...

2016-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14235#discussion_r71100846
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
-  private def checkHiveQl(hiveQl: String): Unit = {
-val df = sql(hiveQl)
+  // Used for generating new query answer files by saving
+  private val saveQuery = false
+
+  /**
+   * Compare the generated SQL with the expected answer string.
+   * Note that there exists a normalization for both arguments for the 
convenience.
+   * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> 
`gen_attr`.
+   */
+  private def checkSQLStructure(originalSQL: String, convertedSQL: String, 
answerFile: String) = {
+val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", 
"`gen_attr`")
--- End diff --

Can we make gen_attr_YYY stable? (can submit a separate pr)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14116#discussion_r71100512
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -471,6 +480,34 @@ class SessionCatalog(
   }
 
   /**
+   * Normalize TableIdentifier by consistently ensuring the following two 
rules.
+   * 1. System-side temporary tables should have None as database.
--- End diff --

`System-wide`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...

2016-07-17 Thread deanchen
Github user deanchen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13988#discussion_r71100480
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.csv
+
+import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Converts a sequence of string to CSV string
+ */
+private[csv] object UnivocityGenerator extends Logging {
+  /**
+   * Transforms a single InternalRow to CSV using Univocity
+   *
+   * @param rowSchema the schema object used for conversion
+   * @param writer a CsvWriter object
+   * @param headers headers to write
+   * @param writeHeader true if it needs to write header
+   * @param options CSVOptions object containing options
+   * @param row The row to convert
+   */
+  def apply(
+  rowSchema: StructType,
+  writer: CsvWriter,
+  headers: Array[String],
+  writeHeader: Boolean,
+  options: CSVOptions)(row: InternalRow): Unit = {
+val tokens = {
+  row.toSeq(rowSchema).map { field =>
+// TODO: It seems all the data types are not able to be 
represented by `toString`.
+// For example, `DateType` and `TimestampType` are being long 
values as timestamps.
--- End diff --

ah thanks, commented on that PR. Glad to see someone showing some love to 
Spark's csv datasource!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13912: [SPARK-16216][SQL] CSV data source supports custom date ...

2016-07-17 Thread deanchen
Github user deanchen commented on the issue:

https://github.com/apache/spark/pull/13912
  
@srowen @rxin Would love to see this get merged as this has been a pain 
point for us. Not a fan of timezoneless dates as an engineer but the need to 
passthrough or write timezoneless dates to csv's has been a necessary task due 
to a variety of reasons in the past.

We use Spark to parse large amounts of daily financial asset data with 
pre-agreed upon date conventions. Most of the dates we deal doesn't belong to a 
timezone and are treated more like indices with the maximum granularity of a 
day. The CSV reports we produce for our customers do not ever contain timezone 
as a result and almost all dates as passthrough from the original timezoneless 
dates. 

This is not just a common pattern with financial data, but also common 
practice in large companies I've worked at in the past. A large tech company 
I've worked at in the past used the convention of MST date and datetime for all 
internal systems(database included).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14132
  
I finished my third pass. Will review it again when your new version is 
ready. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71100213
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -755,4 +755,243 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest 
with SQLTestUtils {
   checkHiveQl("select * from orc_t")
 }
   }
+
+  test("broadcast hint on single table") {
+checkGeneratedSQL(
+  "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0",
+  """
+|SELECT `gen_attr` AS `id`
+|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr`
+|  FROM (SELECT `id` AS `gen_attr`
+|FROM `default`.`parquet_t0`)
+|AS gen_subquery_0)
+|  AS parquet_t0
+|""".stripMargin)
+
+checkGeneratedSQL(
+  "SELECT /*+ MAPJOIN(parquet_t0, parquet_t0) */ * FROM parquet_t0",
+  """
+|SELECT `gen_attr` AS `id`
+|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr`
+|  FROM (SELECT `id` AS `gen_attr`
+|FROM `default`.`parquet_t0`)
+|AS gen_subquery_0)
+|  AS parquet_t0
+|""".stripMargin)
+
+checkGeneratedSQL(
+  "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0 as a",
+  """
+|SELECT `gen_attr` AS `id`
+|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr`
+|  FROM ((SELECT `id` AS `gen_attr`
+| FROM `default`.`parquet_t0`)
+| AS gen_subquery_0)
+|AS a)
+|  AS a
+  """.stripMargin)
+
+checkGeneratedSQL(
+  """
+|SELECT /*+ MAPJOIN(parquet_t0) */ *
--- End diff --

This should be `T. parquet_t0`, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14102: [SPARK-16434][SQL] Avoid per-record type dispatch in JSO...

2016-07-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14102
  
@yhuai Thank you for your review! I will try to address all your comments 
first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...

2016-07-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14102#discussion_r71099616
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
 ---
@@ -35,184 +34,306 @@ import org.apache.spark.util.Utils
 
 private[json] class SparkSQLJsonProcessingException(msg: String) extends 
RuntimeException(msg)
 
-object JacksonParser extends Logging {
+private[sql] class JacksonParser(schema: StructType, options: JSONOptions) 
extends Logging {
+  import com.fasterxml.jackson.core.JsonToken._
 
-  def parse(
-  input: RDD[String],
-  schema: StructType,
-  columnNameOfCorruptRecords: String,
-  configOptions: JSONOptions): RDD[InternalRow] = {
+  // A `ValueConverter` is responsible for converting a value from 
`JsonParser`
+  // to a value in a field for `InternalRow`.
+  private type ValueConverter = (JsonParser) => Any
 
-input.mapPartitions { iter =>
-  parseJson(iter, schema, columnNameOfCorruptRecords, configOptions)
+  // `ValueConverter`s for the root schema for all fields in the schema
+  private val rootConverter: ValueConverter = makeRootConverter(schema)
+
+  private val factory = new JsonFactory()
+  options.setJacksonOptions(factory)
+
+  private def failedConversion(
+  parser: JsonParser,
+  dataType: DataType): Any = parser.getCurrentToken match {
+case null | VALUE_NULL =>
+  null
+
+case _ if parser.getTextLength < 1 =>
+  // guard the non string type
+  null
+
+case token =>
+// We cannot parse this token based on the given data type. So, we 
throw a
+// SparkSQLJsonProcessingException and this exception will be caught by
+// parseJson method.
+throw new SparkSQLJsonProcessingException(
+  s"Failed to parse a value for data type $dataType (current token: 
$token).")
+  }
+
+  private def failedRecord(record: String): Seq[InternalRow] = {
+// create a row even if no corrupt record column is present
+if (options.failFast) {
+  throw new RuntimeException(s"Malformed line in FAILFAST mode: 
$record")
+}
+if (options.dropMalformed) {
+  logWarning(s"Dropping malformed line: $record")
+  Nil
+} else {
+  val row = new GenericMutableRow(schema.length)
+  for (corruptIndex <- 
schema.getFieldIndex(options.columnNameOfCorruptRecord)) {
+require(schema(corruptIndex).dataType == StringType)
+row.update(corruptIndex, UTF8String.fromString(record))
+  }
+  Seq(row)
 }
   }
 
   /**
-   * Parse the current token (and related children) according to a desired 
schema
-   * This is a wrapper for the method `convertField()` to handle a row 
wrapped
-   * with an array.
+   * Create a converter which converts the JSON documents held by the 
`JsonParser`
+   * to a value according to a desired schema. This is a wrapper for the 
method
+   * `makeConverter()` to handle a row wrapped with an array.
*/
-  def convertRootField(
-  factory: JsonFactory,
-  parser: JsonParser,
-  schema: DataType): Any = {
-import com.fasterxml.jackson.core.JsonToken._
-(parser.getCurrentToken, schema) match {
-  case (START_ARRAY, st: StructType) =>
-// SPARK-3308: support reading top level JSON arrays and take 
every element
-// in such an array as a row
-convertArray(factory, parser, st)
-
-  case (START_OBJECT, ArrayType(st, _)) =>
-// the business end of SPARK-3308:
-// when an object is found but an array is requested just wrap it 
in a list
-convertField(factory, parser, st) :: Nil
+  def makeRootConverter(dataType: DataType): ValueConverter = dataType 
match {
+case st: StructType =>
+  // SPARK-3308: support reading top level JSON arrays and take every 
element
+  // in such an array as a row
+  val elementConverter = makeConverter(st)
+  val fieldConverters = st.map(_.dataType).map(makeConverter)
+  (parser: JsonParser) => parser.getCurrentToken match {
+case START_OBJECT => convertObject(parser, st, fieldConverters)
+case START_ARRAY => convertArray(parser, elementConverter)
+case _ => failedConversion(parser, st)
+  }
 
-  case _ =>
-convertField(factory, parser, schema)
-}
+case ArrayType(st: StructType, _) =>
+  // the business end of SPARK-3308:
+  // when an object is found but an array is requested just wrap it in 
a list
+  val elementConverter = makeConverter(st)
+  val fieldConverters = st.map(_.dataType

[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14235
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62446/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14235
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14235
  
**[Test build #62446 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62446/consoleFull)**
 for PR 14235 at commit 
[`e2a7ac4`](https://github.com/apache/spark/commit/e2a7ac445e7f0288d384996dc0e36195e5849c5a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...

2016-07-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14102#discussion_r71099344
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
 ---
@@ -35,184 +34,306 @@ import org.apache.spark.util.Utils
 
 private[json] class SparkSQLJsonProcessingException(msg: String) extends 
RuntimeException(msg)
 
-object JacksonParser extends Logging {
+private[sql] class JacksonParser(schema: StructType, options: JSONOptions) 
extends Logging {
+  import com.fasterxml.jackson.core.JsonToken._
 
-  def parse(
-  input: RDD[String],
-  schema: StructType,
-  columnNameOfCorruptRecords: String,
-  configOptions: JSONOptions): RDD[InternalRow] = {
+  // A `ValueConverter` is responsible for converting a value from 
`JsonParser`
+  // to a value in a field for `InternalRow`.
+  private type ValueConverter = (JsonParser) => Any
 
-input.mapPartitions { iter =>
-  parseJson(iter, schema, columnNameOfCorruptRecords, configOptions)
+  // `ValueConverter`s for the root schema for all fields in the schema
+  private val rootConverter: ValueConverter = makeRootConverter(schema)
+
+  private val factory = new JsonFactory()
+  options.setJacksonOptions(factory)
+
+  private def failedConversion(
+  parser: JsonParser,
+  dataType: DataType): Any = parser.getCurrentToken match {
+case null | VALUE_NULL =>
+  null
+
+case _ if parser.getTextLength < 1 =>
+  // guard the non string type
+  null
--- End diff --

These two cases are only the ones which do not care of `DataType`, it was 
[here](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L83-L84)
 and 
[here](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L93-L95).
So, I put them here because all the `ValueConverter` should treat this 
case. I can move them into `ValueConverter` if it looks not great.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71099153
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
+val tid = if (names.length > 1) {
+  TableIdentifier(names(1), Some(names(0)))
+} else {
+  TableIdentifier(param, None)
+}
+try {
+  catalog.lookupRelation(tid)
+
+  var stop = false
+  resolvedChild = resolvedChild.transformDown {
+case r @ BroadcastHint(SubqueryAlias(t, _))
+  if !stop && resolver(t, tid.identifier) =>
+  stop = true
+  r
+case r @ SubqueryAlias(t, _) if !stop && resolver(t, 
tid.identifier) =>
+  stop = true
+  BroadcastHint(r)
--- End diff --

When `BroadcastHint` is crossing `SubqueryAlias`, we need to remove the 
qualifier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71099090
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala ---
@@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends 
Logging {
   }
 }
 
+/**
+ * Merge and move upward to the nearest Project.
+ * A broadcast hint comment is scattered into multiple nodes inside 
the plan, and the
+ * information of BroadcastHint resides its current position inside 
the plan. In order to
+ * reconstruct broadcast hint comment, we need to pack the information 
of BroadcastHint into
+ * Hint("BROADCAST", _, _) and collect them up by moving upward to the 
nearest Project node.
+ */
+object NormalizeBroadcastHint extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
transformUp {
+// Capture the broadcasted information and store it in Hint.
+case BroadcastHint(child @ SubqueryAlias(_, Project(_, 
SQLTable(database, table, _, _ =>
+  Hint("BROADCAST", Seq(table), child)
+
+// Nearest Project is found.
+case p @ Project(_, Hint(_, _, _)) => p
+
+// Merge BROADCAST hints up to the nearest Project.
+case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) 
=>
+  h.copy(parameters = params1 ++ params2)
+case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ 
Hint("BROADCAST", p2, right), _, _) =>
+  h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right 
= right))
+
+// Bubble up BROADCAST hints to the nearest Project.
+case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) =>
+  h.copy(child = j.copy(left = hintChild))
+case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) =>
+  h.copy(child = j.copy(right = hintChild))
+case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) =>
--- End diff --

When you moving `Hint` and crossing a `SubqueryAlias`, we should consider 
adding the qualifier back. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...

2016-07-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14102#discussion_r71099098
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONOptions.scala
 ---
@@ -51,7 +53,8 @@ private[sql] class JSONOptions(
 
parameters.get("allowBackslashEscapingAnyCharacter").map(_.toBoolean).getOrElse(false)
   val compressionCodec = 
parameters.get("compression").map(CompressionCodecs.getCodecClassName)
   private val parseMode = parameters.getOrElse("mode", "PERMISSIVE")
-  val columnNameOfCorruptRecord = 
parameters.get("columnNameOfCorruptRecord")
+  val columnNameOfCorruptRecord =
+parameters.getOrElse("columnNameOfCorruptRecord", 
sqlConf.columnNameOfCorruptRecord)
--- End diff --

Oh, actually, the reason was not only to remove duplicated logics. Another 
reason is that we are passing `columnNameOfCorruptRecords` to 
`InferSchema.infer(...)` and `JacksonParser.parse(...)`


```scala
def infer(
json: RDD[String],
columnNameOfCorruptRecords: String,
configOptions: JSONOptions): StructType = {
  ...
```

```scala
def parse(
input: RDD[String],
schema: StructType,
columnNameOfCorruptRecords: String,
configOptions: JSONOptions): RDD[InternalRow] = {
  ...
```

although `columnNameOfCorruptRecords` is an option in `JSONOptions`. I 
would revert this change if it still does not sound good but I just wanted to 
let you know just in case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098985
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala ---
@@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends 
Logging {
   }
 }
 
+/**
+ * Merge and move upward to the nearest Project.
+ * A broadcast hint comment is scattered into multiple nodes inside 
the plan, and the
+ * information of BroadcastHint resides its current position inside 
the plan. In order to
+ * reconstruct broadcast hint comment, we need to pack the information 
of BroadcastHint into
+ * Hint("BROADCAST", _, _) and collect them up by moving upward to the 
nearest Project node.
+ */
+object NormalizeBroadcastHint extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
transformUp {
+// Capture the broadcasted information and store it in Hint.
+case BroadcastHint(child @ SubqueryAlias(_, Project(_, 
SQLTable(database, table, _, _ =>
+  Hint("BROADCAST", Seq(table), child)
+
+// Nearest Project is found.
+case p @ Project(_, Hint(_, _, _)) => p
+
+// Merge BROADCAST hints up to the nearest Project.
+case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) 
=>
+  h.copy(parameters = params1 ++ params2)
+case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ 
Hint("BROADCAST", p2, right), _, _) =>
+  h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right 
= right))
+
+// Bubble up BROADCAST hints to the nearest Project.
+case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) =>
+  h.copy(child = j.copy(left = hintChild))
+case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) =>
+  h.copy(child = j.copy(right = hintChild))
+case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) =>
+  h.copy(child = s.copy(child = hintChild))
+case ll @ LocalLimit(_, h @ Hint("BROADCAST", _, hintChild)) =>
+  h.copy(child = ll.copy(child = hintChild))
+case f @ Filter(_, h @ Hint("BROADCAST", _, hintChild)) =>
+  h.copy(child = f.copy(child = hintChild))
+case a @ Aggregate(_, _, h @ Hint("BROADCAST", _, hintChild)) =>
+  h.copy(child = a.copy(child = hintChild))
+case s @ Sort(_, _, h @ Hint("BROADCAST", _, hintChild)) =>
+  h.copy(child = s.copy(child = hintChild))
+case g @ Generate(_, _, _, _, _, h @ Hint("BROADCAST", _, 
hintChild)) =>
+  h.copy(child = g.copy(child = hintChild))
+// Set operation is not allowed to be across. 
UNION/INTERCEPT/EXCEPT
--- End diff --

Is this a restriction? Any reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098966
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala ---
@@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends 
Logging {
   }
 }
 
+/**
+ * Merge and move upward to the nearest Project.
+ * A broadcast hint comment is scattered into multiple nodes inside 
the plan, and the
+ * information of BroadcastHint resides its current position inside 
the plan. In order to
+ * reconstruct broadcast hint comment, we need to pack the information 
of BroadcastHint into
+ * Hint("BROADCAST", _, _) and collect them up by moving upward to the 
nearest Project node.
+ */
+object NormalizeBroadcastHint extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
transformUp {
+// Capture the broadcasted information and store it in Hint.
+case BroadcastHint(child @ SubqueryAlias(_, Project(_, 
SQLTable(database, table, _, _ =>
+  Hint("BROADCAST", Seq(table), child)
+
+// Nearest Project is found.
+case p @ Project(_, Hint(_, _, _)) => p
+
+// Merge BROADCAST hints up to the nearest Project.
+case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) 
=>
+  h.copy(parameters = params1 ++ params2)
+case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ 
Hint("BROADCAST", p2, right), _, _) =>
+  h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right 
= right))
+
+// Bubble up BROADCAST hints to the nearest Project.
--- End diff --

You do not need to enumerate all the types. This depends on the tree 
structure. `UnaryNode`, `BinaryNode` and so on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098821
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala ---
@@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends 
Logging {
   }
 }
 
+/**
+ * Merge and move upward to the nearest Project.
+ * A broadcast hint comment is scattered into multiple nodes inside 
the plan, and the
+ * information of BroadcastHint resides its current position inside 
the plan. In order to
+ * reconstruct broadcast hint comment, we need to pack the information 
of BroadcastHint into
+ * Hint("BROADCAST", _, _) and collect them up by moving upward to the 
nearest Project node.
+ */
+object NormalizeBroadcastHint extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
transformUp {
+// Capture the broadcasted information and store it in Hint.
+case BroadcastHint(child @ SubqueryAlias(_, Project(_, 
SQLTable(database, table, _, _ =>
--- End diff --

uh, your solution depends the rule `AddSubquery`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14028
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14028
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62445/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14028
  
**[Test build #62445 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62445/consoleFull)**
 for PR 14028 at commit 
[`6570a98`](https://github.com/apache/spark/commit/6570a9874e60ecb9366ea37a0e5dfe06b821dc62).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098603
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala ---
@@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends 
Logging {
   }
 }
 
+/**
+ * Merge and move upward to the nearest Project.
+ * A broadcast hint comment is scattered into multiple nodes inside 
the plan, and the
+ * information of BroadcastHint resides its current position inside 
the plan. In order to
+ * reconstruct broadcast hint comment, we need to pack the information 
of BroadcastHint into
+ * Hint("BROADCAST", _, _) and collect them up by moving upward to the 
nearest Project node.
+ */
+object NormalizeBroadcastHint extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
transformUp {
+// Capture the broadcasted information and store it in Hint.
+case BroadcastHint(child @ SubqueryAlias(_, Project(_, 
SQLTable(database, table, _, _ =>
--- End diff --

Why you always can have a `SubqueryAlias` above `Project`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098299
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala ---
@@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends 
Logging {
   }
 }
 
+/**
+ * Merge and move upward to the nearest Project.
+ * A broadcast hint comment is scattered into multiple nodes inside 
the plan, and the
+ * information of BroadcastHint resides its current position inside 
the plan. In order to
+ * reconstruct broadcast hint comment, we need to pack the information 
of BroadcastHint into
+ * Hint("BROADCAST", _, _) and collect them up by moving upward to the 
nearest Project node.
+ */
+object NormalizeBroadcastHint extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
transformUp {
+// Capture the broadcasted information and store it in Hint.
+case BroadcastHint(child @ SubqueryAlias(_, Project(_, 
SQLTable(database, table, _, _ =>
+  Hint("BROADCAST", Seq(table), child)
+
+// Nearest Project is found.
+case p @ Project(_, Hint(_, _, _)) => p
+
+// Merge BROADCAST hints up to the nearest Project.
+case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) 
=>
+  h.copy(parameters = params1 ++ params2)
+case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ 
Hint("BROADCAST", p2, right), _, _) =>
+  h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right 
= right))
+
+// Bubble up BROADCAST hints to the nearest Project.
--- End diff --

Can you improve it? This is hard to maintain.

For example, why `LocalLimit` is included but `GlobalLimit` is not 
included? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098109
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
--- End diff --

Oh, you mean a quoted style. It's a different story. Let me check that in 
Hive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...

2016-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14244
  
**[Test build #62447 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62447/consoleFull)**
 for PR 14244 at commit 
[`871d358`](https://github.com/apache/spark/commit/871d358b3b24b5b2acf46f5d882f4b8614fdd670).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098026
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
--- End diff --

I intentionally remove that from this PR. You cannot get all in a single PR.
I think you know Oracle Global Hint (you already asked to support 
view_name.table_name.), you can do better than me at that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71098009
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -87,6 +87,7 @@ class Analyzer(
   EliminateUnions),
 Batch("Resolution", fixedPoint,
   ResolveRelations ::
+  SubstituteHints ::
--- End diff --

Have you tried to move it after the batch `Resolution`? Any error you hit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...

2016-07-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/13988#discussion_r71097989
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.csv
+
+import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Converts a sequence of string to CSV string
+ */
+private[csv] object UnivocityGenerator extends Logging {
+  /**
+   * Transforms a single InternalRow to CSV using Univocity
+   *
+   * @param rowSchema the schema object used for conversion
+   * @param writer a CsvWriter object
+   * @param headers headers to write
+   * @param writeHeader true if it needs to write header
+   * @param options CSVOptions object containing options
+   * @param row The row to convert
+   */
+  def apply(
+  rowSchema: StructType,
+  writer: CsvWriter,
+  headers: Array[String],
+  writeHeader: Boolean,
+  options: CSVOptions)(row: InternalRow): Unit = {
+val tokens = {
+  row.toSeq(rowSchema).map { field =>
+// TODO: It seems all the data types are not able to be 
represented by `toString`.
+// For example, `DateType` and `TimestampType` are being long 
values as timestamps.
--- End diff --

Actually, I opened another PR here, 
https://github.com/apache/spark/pull/13912. Maybe it is about this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71097984
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
--- End diff --

Yep. You can try to another PR. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71097965
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
--- End diff --

See the PR: https://github.com/apache/spark/pull/14244 

I think we should ensure that all the cases in 
`TableIdentifierParserSuite.scala` if you do not want to parse it in `Analyzer`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14244: [SPARK-16598] [SQL] [TEST] Added a test case for ...

2016-07-17 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/14244

[SPARK-16598] [SQL] [TEST] Added a test case for verifying the table 
identifier parsing

 What changes were proposed in this pull request?
So far, the test cases of `TableIdentifierParserSuite` do not cover the 
quoted cases. We should add one for avoiding regression.

 How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark quotedIdentifiers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14244


commit 871d358b3b24b5b2acf46f5d882f4b8614fdd670
Author: gatorsmile 
Date:   2016-07-18T03:29:56Z

added a test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71097771
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
--- End diff --

Let me try the quoted identifiers in the parser. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71097752
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
--- End diff --

How about Oracle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-17 Thread ooq
Github user ooq commented on the issue:

https://github.com/apache/spark/pull/14174
  
hey @sameeragarwal, the diff of the generated code can be found here (with 
the left side for vectorized hashmap and right side for the new, row-based 
one): 
https://gist.github.com/ooq/d25d4de4c445752c6e1a810262c2e5c1/revisions?diff=split
Below is my code to generate them:
```
  test("generated code comparison for vectorized vs. rowbased") {
val N = 20 << 23

sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
sparkSession.conf.set("spark.sql.codegen.aggregate.map.columns.max", 
"30")

sparkSession.range(N)
  .selectExpr(
"id & 1023 as k1",
"cast (id & 1023 as string) as k2")
  .createOrReplaceTempView("test")

// dataframe/query
val query = sparkSession.sql("select count(k1), sum(k1) from test group 
by k1, k2")

// vectorized
sparkSession.conf.set("spark.sql.codegen.aggregate.map.enforce.impl", 
"vectorized")
query.queryExecution.debug.codegen()

// row based
sparkSession.conf.set("spark.sql.codegen.aggregate.map.enforce.impl", 
"rowbased")
query.queryExecution.debug.codegen()
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71097403
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -87,6 +87,7 @@ class Analyzer(
   EliminateUnions),
 Batch("Resolution", fixedPoint,
   ResolveRelations ::
+  SubstituteHints ::
--- End diff --

That's a good point. What is the best policy for a `Rule` with a fixedPoint 
`Once`?
To minimize the side effect of `Hint` for other rules, I put it at the very 
next of `ResolveRelation`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-07-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14132#discussion_r71097256
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1774,6 +1775,51 @@ class Analyzer(
   }
 
   /**
+   * Substitute Hints.
+   * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the 
given name parameters.
+   */
+  object SubstituteHints extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case logical: LogicalPlan => logical transformDown {
+case h @ Hint(name, parameters, child)
+if Seq("BROADCAST", "BROADCASTJOIN", 
"MAPJOIN").contains(name.toUpperCase) =>
+  var resolvedChild = child
+
+  for (param <- parameters) {
+val names = param.split("\\.")
--- End diff --

Currently, the parser does not allow to specify database name. (I didn't 
commit the code since I found that Hive doesn't, too.) In case of the parser 
acceptance, it's valid for all the case because the parser will accept only 
identifiers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...

2016-07-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14102#discussion_r71097210
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
 ---
@@ -35,184 +34,306 @@ import org.apache.spark.util.Utils
 
 private[json] class SparkSQLJsonProcessingException(msg: String) extends 
RuntimeException(msg)
 
-object JacksonParser extends Logging {
+private[sql] class JacksonParser(schema: StructType, options: JSONOptions) 
extends Logging {
+  import com.fasterxml.jackson.core.JsonToken._
 
-  def parse(
-  input: RDD[String],
-  schema: StructType,
-  columnNameOfCorruptRecords: String,
-  configOptions: JSONOptions): RDD[InternalRow] = {
+  // A `ValueConverter` is responsible for converting a value from 
`JsonParser`
+  // to a value in a field for `InternalRow`.
+  private type ValueConverter = (JsonParser) => Any
 
-input.mapPartitions { iter =>
-  parseJson(iter, schema, columnNameOfCorruptRecords, configOptions)
+  // `ValueConverter`s for the root schema for all fields in the schema
+  private val rootConverter: ValueConverter = makeRootConverter(schema)
+
+  private val factory = new JsonFactory()
+  options.setJacksonOptions(factory)
+
+  private def failedConversion(
+  parser: JsonParser,
+  dataType: DataType): Any = parser.getCurrentToken match {
+case null | VALUE_NULL =>
+  null
+
+case _ if parser.getTextLength < 1 =>
+  // guard the non string type
+  null
+
+case token =>
+// We cannot parse this token based on the given data type. So, we 
throw a
+// SparkSQLJsonProcessingException and this exception will be caught by
+// parseJson method.
+throw new SparkSQLJsonProcessingException(
+  s"Failed to parse a value for data type $dataType (current token: 
$token).")
+  }
+
+  private def failedRecord(record: String): Seq[InternalRow] = {
+// create a row even if no corrupt record column is present
+if (options.failFast) {
+  throw new RuntimeException(s"Malformed line in FAILFAST mode: 
$record")
+}
+if (options.dropMalformed) {
+  logWarning(s"Dropping malformed line: $record")
+  Nil
+} else {
+  val row = new GenericMutableRow(schema.length)
+  for (corruptIndex <- 
schema.getFieldIndex(options.columnNameOfCorruptRecord)) {
+require(schema(corruptIndex).dataType == StringType)
+row.update(corruptIndex, UTF8String.fromString(record))
+  }
+  Seq(row)
 }
   }
 
   /**
-   * Parse the current token (and related children) according to a desired 
schema
-   * This is a wrapper for the method `convertField()` to handle a row 
wrapped
-   * with an array.
+   * Create a converter which converts the JSON documents held by the 
`JsonParser`
+   * to a value according to a desired schema. This is a wrapper for the 
method
+   * `makeConverter()` to handle a row wrapped with an array.
*/
-  def convertRootField(
-  factory: JsonFactory,
-  parser: JsonParser,
-  schema: DataType): Any = {
-import com.fasterxml.jackson.core.JsonToken._
-(parser.getCurrentToken, schema) match {
-  case (START_ARRAY, st: StructType) =>
-// SPARK-3308: support reading top level JSON arrays and take 
every element
-// in such an array as a row
-convertArray(factory, parser, st)
-
-  case (START_OBJECT, ArrayType(st, _)) =>
-// the business end of SPARK-3308:
-// when an object is found but an array is requested just wrap it 
in a list
-convertField(factory, parser, st) :: Nil
+  def makeRootConverter(dataType: DataType): ValueConverter = dataType 
match {
+case st: StructType =>
+  // SPARK-3308: support reading top level JSON arrays and take every 
element
+  // in such an array as a row
+  val elementConverter = makeConverter(st)
+  val fieldConverters = st.map(_.dataType).map(makeConverter)
+  (parser: JsonParser) => parser.getCurrentToken match {
+case START_OBJECT => convertObject(parser, st, fieldConverters)
+case START_ARRAY => convertArray(parser, elementConverter)
+case _ => failedConversion(parser, st)
+  }
 
-  case _ =>
-convertField(factory, parser, schema)
-}
+case ArrayType(st: StructType, _) =>
+  // the business end of SPARK-3308:
+  // when an object is found but an array is requested just wrap it in 
a list
+  val elementConverter = makeConverter(st)
+  val fieldConverters = st.map(_.dataType).map(

  1   2   3   4   >