[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14116 Thank you so much, @gatorsmile . And, sorry for late response. Definitely, I have many things to do. Now, it's my turn. Let's see how much I can handle them. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14235 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14235 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62448/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14235 **[Test build #62450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62450/consoleFull)** for PR 14235 at commit [`a9a1b00`](https://github.com/apache/spark/commit/a9a1b00411fa6175dd1e628a3485a4313999b636). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14098 Please be careful with case sensitivity. It broke the release candidate last time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to chec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14235 **[Test build #62448 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62448/consoleFull)** for PR 14235 at commit [`38f52ce`](https://github.com/apache/spark/commit/38f52cea6fbd53cf66429d396663708afcb1f193). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14179: [SPARK-16055][SPARKR] warning added while using sparkPac...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14179 LGTM. btw, for future references, you could either ``` # sep defaults to " " paste("sparkPackages has no effect when using spark-submit or sparkR shell,", "please use the --packages commandline instead") ``` or ``` paste0("sparkPackages has no effect when using spark-submit or sparkR shell, ", "please use the --packages commandline instead") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14065: [SPARK-14743][YARN] Add a configurable token mana...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/14065#discussion_r71104553 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -390,8 +393,22 @@ private[spark] class Client( // Upload Spark and the application JAR to the remote file system if necessary, // and add them as local resources to the application master. val fs = destDir.getFileSystem(hadoopConf) -val nns = YarnSparkHadoopUtil.get.getNameNodesToAccess(sparkConf) + destDir -YarnSparkHadoopUtil.get.obtainTokensForNamenodes(nns, hadoopConf, credentials) + +// Merge credentials obtained from registered providers +var nearestTimeOfNextRenewal = Long.MaxValue +credentialManager.obtainCredentials(hadoopConf, credentials).foreach { t => --- End diff -- Yes, in the current implementation of `HDFSCredentialProvider`, time of next token renewal will only be calculated when principal is provided (`loginFromKeytab` is true). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14207 Does it mean that if users do not issue refresh when the table location is changed, the schema will be wrong when the Spark is re-starting? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14243: [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include j...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14243 Looks good, interesting approach - this won't run in CRAN since this is part of Scala tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14244 cc @hvanhovell Is this test included in the other suite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14245 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14245 **[Test build #62449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62449/consoleFull)** for PR 14245 at commit [`e797c30`](https://github.com/apache/spark/commit/e797c30b94fb191b844ab7ef247a84baf5260379). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14245 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62449/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14243: [SPARK-10683][SPARK-16510][SPARKR] Move SparkR in...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14243#discussion_r71104052 --- Diff: R/pkg/inst/tests/testthat/jarTest.R --- @@ -16,17 +16,17 @@ # library(SparkR) -sparkR.session() +sc <- sparkR.session() --- End diff -- probably left behind from merge - `sc` was not referenced and was removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14207 @viirya The problem it tries to resolve is from the comment of @rxin in another PR: https://github.com/apache/spark/pull/14148#issuecomment-232273833 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14116 Finished my first pass. The major concern is handling of `INFORMATION_SCHEMA` is not clean to me. It looks hacky. Many holes are caused by it. More test cases are needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14098 @wangmiao1981 I guess it's not ready yet. You may put a `[WIP]` tag in the PR title when it's in WIP status and remove it when it is ready for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14098#discussion_r71103329 --- Diff: docs/sql-programming-guide.md --- @@ -79,7 +79,7 @@ The entry point into all functionality in Spark is the [`SparkSession`](api/java The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`: -{% include_example init_session python/sql.py %} +{% include_example init_session python/SparkSQLExample.py %} --- End diff -- And the file path is wrong. The actual path is `python/sql/SparkSqlExample.py`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14150: [SPARK-16494] [ML] Upgrade breeze version to 0.12
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/14150 @srowen I found no obvious compatibility issues after reading the release notes. If this looks good, please let it get in, since [SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181) depends upon this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14116#discussion_r71103250 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/systemcatalog/InformationSchema.scala --- @@ -0,0 +1,312 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.systemcatalog + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.catalog.SessionCatalog.{DEFAULT_DATABASE, INFORMATION_SCHEMA_DATABASE} +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical.Project +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.internal.CatalogImpl +import org.apache.spark.sql.sources._ +import org.apache.spark.sql.types._ + +/** + * INFORMATION_SCHEMA is a database consisting views which provide information about all of the + * tables, views, columns in a database. + * + * These views are designed to be populated by this package in order to be independent from + * Spark catalog. To keep minimal dependency, currently INFORMATION_SCHEMA views are implemented as + * Spark temporary views with a database prefix: `SessionCatalog.INFORMATION_SCHEMA_DATABASE`. + * + * The following is the class hierarchy in this package rooted at InformationSchemaRelationProvider. + * + * InformationSchemaRelationProvider + * -> DatabasesRelationProvider + * -> TablesRelationProvider + * -> ViewsRelationProvider + * -> ColumnsRelationProvider + * -> SessionVariablesRelationProvider + */ + +/** + * InformationSchema object provides bootstrap and utility functions. + */ +object InformationSchema { + + /** + * Register INFORMATION_SCHEMA database. SessionCatalog.catalog invokes this function. + */ + def registerInformationSchema(sparkSession: SparkSession): Unit = { +sparkSession.sql(s"CREATE DATABASE IF NOT EXISTS $INFORMATION_SCHEMA_DATABASE") +registerView(sparkSession, new DatabasesRelationProvider, Seq("schemata", "databases")) +registerView(sparkSession, new TablesRelationProvider, Seq("tables")) +registerView(sparkSession, new ViewsRelationProvider, Seq("views")) +registerView(sparkSession, new ColumnsRelationProvider, Seq("columns")) +registerView(sparkSession, new SessionVariablesRelationProvider, Seq("session_variables")) + } + + /** + * Register an INFORMATION_SCHEMA relation provider as a temporary view of Spark Catalog. + */ + private def registerView( + sparkSession: SparkSession, + relationProvider: SchemaRelationProvider, + names: Seq[String]) { +val plan = + LogicalRelation(relationProvider.createRelation(sparkSession.sqlContext, null, null)).analyze +val projectList = plan.output.zip(plan.schema).map { + case (attr, col) => Alias(attr, col.name)() +} +sparkSession.sessionState.executePlan(Project(projectList, plan)) + +for (name <- names) { + sparkSession.sessionState.catalog.createTempView(s"$INFORMATION_SCHEMA_DATABASE.$name", +plan, overrideIfExists = true) +} + } + + /** + * Only EqualTo filters are handled in INFORMATION_SCHEMA data sources. + */ + def unhandledFilters(filters: Array[Filter], columnName: String): Array[Filter] = { +import org.apache.spark.sql.sources.EqualTo +filters.filter { + case EqualTo(attribute, _) if attribute.equalsIgnoreCase(columnName) => false + case _ => true +} + } + + /** + * Return `EqualTo` filtered DataFrame. + */ + def getFilteredTables(sparkSession: SparkSession, filters: Seq[Expression], columnName: String) + : DataFrame = { +import org.apache.sp
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14098 @wangmiao1981 Is this ready for review now? Also, please update the PR title to: ``` [SPARK-16380][SQL][EXAMPLE] Update SQL examples and programming guide for Python language binding ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14098#discussion_r71103044 --- Diff: docs/sql-programming-guide.md --- @@ -79,7 +79,7 @@ The entry point into all functionality in Spark is the [`SparkSession`](api/java The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`: -{% include_example init_session python/sql.py %} +{% include_example init_session python/SparkSQLExample.py %} --- End diff -- The original file name `sql.py` should be fine. Actually, the new name `SparkSQLExample.py` doesn't conform to Python convention. (You may check other file names under `examples/src/main/python`.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71103029 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- While making that, I feel this becomes much convenient. Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14207 I think it is not clear what the problem this PR tries to solve is. It just says it proposes to save the inferred schema in external catalog. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14245 LGTM. Can we reuse a existing jira number? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example update
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14245 **[Test build #62449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62449/consoleFull)** for PR 14245 at commit [`e797c30`](https://github.com/apache/spark/commit/e797c30b94fb191b844ab7ef247a84baf5260379). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14169: [SPARK-16515][SQL]set default record reader and w...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14169#discussion_r71102534 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -1340,10 +1340,17 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { val name = conf.getConfString("hive.script.serde", "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe") val props = Seq("field.delim" -> "\t") -val recordHandler = Try(conf.getConfString(configKey)).toOption +val recordHandler = defaultRecordHandler(configKey) (Nil, Option(name), props, recordHandler) } +def defaultRecordHandler(configKey: String): Option[String] = { + Try(conf.getConfString(configKey)).orElse(Try(configKey match { +case "hive.script.recordreader" => "org.apache.hadoop.hive.ql.exec.TextRecordReader" +case "hive.script.recordwriter" => "org.apache.hadoop.hive.ql.exec.TextRecordWriter" + })).toOption +} + val (inFormat, inSerdeClass, inSerdeProps, reader) = format(inRowFormat, "hive.script.recordreader") --- End diff -- Can we pass in the default value for the reader/writer? like `format(inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader")` and `format(outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter")`. Then, in `def format`, we just use `getConfString(key: String, defaultValue: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14245: [MINOR][DOCS][EXAMPLES] Minor Scala example updat...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/14245 [MINOR][DOCS][EXAMPLES] Minor Scala example update ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. ## How was this patch tested? Manually verified the generated HTML page. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark minor-scala-example-update Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14245 commit e797c30b94fb191b844ab7ef247a84baf5260379 Author: Cheng Lian Date: 2016-07-18T06:00:33Z Minor Scala example update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71102449 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- Yep. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14116#discussion_r71102423 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/systemcatalog/InformationSchema.scala --- @@ -0,0 +1,312 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.systemcatalog + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.catalog.SessionCatalog.{DEFAULT_DATABASE, INFORMATION_SCHEMA_DATABASE} +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical.Project +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.internal.CatalogImpl +import org.apache.spark.sql.sources._ +import org.apache.spark.sql.types._ + +/** + * INFORMATION_SCHEMA is a database consisting views which provide information about all of the + * tables, views, columns in a database. + * + * These views are designed to be populated by this package in order to be independent from + * Spark catalog. To keep minimal dependency, currently INFORMATION_SCHEMA views are implemented as + * Spark temporary views with a database prefix: `SessionCatalog.INFORMATION_SCHEMA_DATABASE`. + * + * The following is the class hierarchy in this package rooted at InformationSchemaRelationProvider. + * + * InformationSchemaRelationProvider + * -> DatabasesRelationProvider + * -> TablesRelationProvider + * -> ViewsRelationProvider + * -> ColumnsRelationProvider + * -> SessionVariablesRelationProvider + */ + +/** + * InformationSchema object provides bootstrap and utility functions. + */ +object InformationSchema { + + /** + * Register INFORMATION_SCHEMA database. SessionCatalog.catalog invokes this function. + */ + def registerInformationSchema(sparkSession: SparkSession): Unit = { +sparkSession.sql(s"CREATE DATABASE IF NOT EXISTS $INFORMATION_SCHEMA_DATABASE") +registerView(sparkSession, new DatabasesRelationProvider, Seq("schemata", "databases")) +registerView(sparkSession, new TablesRelationProvider, Seq("tables")) +registerView(sparkSession, new ViewsRelationProvider, Seq("views")) +registerView(sparkSession, new ColumnsRelationProvider, Seq("columns")) +registerView(sparkSession, new SessionVariablesRelationProvider, Seq("session_variables")) + } + + /** + * Register an INFORMATION_SCHEMA relation provider as a temporary view of Spark Catalog. + */ + private def registerView( + sparkSession: SparkSession, + relationProvider: SchemaRelationProvider, + names: Seq[String]) { +val plan = + LogicalRelation(relationProvider.createRelation(sparkSession.sqlContext, null, null)).analyze --- End diff -- `null, null` -> `parameters = null`, schema = null` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14065: [SPARK-14743][YARN] Add a configurable token mana...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/14065#discussion_r71102254 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/security/ConfigurableCredentialManager.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.yarn.security + +import java.util.ServiceLoader + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.security.Credentials + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Utils + +/** + * A ConfigurableCredentialManager to manage all the registered credential providers and offer + * APIs for other modules to obtain credentials as well as renewal time. By default + * [[HDFSCredentialProvider]], [[HiveCredentialProvider]] and [[HBaseCredentialProvider]] will + * be loaded in, any plugged-in credential provider wants to be managed by + * ConfigurableCredentialManager needs to implement [[ServiceCredentialProvider]] interface and put + * into resources to be loaded by ServiceLoader. + * + * Also the specific credential provider is controlled by + * spark.yarn.security.credentials.{service}.enabled, it will not be loaded in if set to false. + */ +final class ConfigurableCredentialManager private[yarn] (sparkConf: SparkConf) extends Logging { + private val deprecatedProviderEnabledConfig = "spark.yarn.security.tokens.%s.enabled" + private val providerEnabledConfig = "spark.yarn.security.credentials.%s.enabled" + + // Maintain all the registered credential providers + private val credentialProviders = mutable.HashMap[String, ServiceCredentialProvider]() + + // Default crendetial providers that will be loaded automatically, unless specifically disabled. + private val defaultCredentialProviders = Map( --- End diff -- Sorry my fault, it is not necessary now, I will remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14235 Can you also remove "[TEST]" from the title? TEST isn't a module. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14065: [SPARK-14743][YARN] Add a configurable token mana...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/14065#discussion_r71102195 --- Diff: dev/.rat-excludes --- @@ -99,3 +99,4 @@ spark-deps-.* .*tsv org.apache.spark.scheduler.ExternalClusterManager .*\.sql +org.apache.spark.deploy.yarn.security.ServiceCredentialProvider --- End diff -- I followed the convention of other services like `DataSource` which also needs to be excluded from rat check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71102174 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- Also instead of using a variable as a flag, maybe we can use an environmental variable, and then document in the beginning ``` * To re-generate golden files, run: SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "hive/test-Only * LogicalPlanToSQLSuite" ``` something like this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14116#discussion_r71102142 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/systemcatalog/InformationSchema.scala --- @@ -0,0 +1,312 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.systemcatalog + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.catalog.SessionCatalog.{DEFAULT_DATABASE, INFORMATION_SCHEMA_DATABASE} +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical.Project +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.internal.CatalogImpl +import org.apache.spark.sql.sources._ +import org.apache.spark.sql.types._ + +/** + * INFORMATION_SCHEMA is a database consisting views which provide information about all of the + * tables, views, columns in a database. + * + * These views are designed to be populated by this package in order to be independent from + * Spark catalog. To keep minimal dependency, currently INFORMATION_SCHEMA views are implemented as + * Spark temporary views with a database prefix: `SessionCatalog.INFORMATION_SCHEMA_DATABASE`. + * + * The following is the class hierarchy in this package rooted at InformationSchemaRelationProvider. + * + * InformationSchemaRelationProvider + * -> DatabasesRelationProvider + * -> TablesRelationProvider + * -> ViewsRelationProvider + * -> ColumnsRelationProvider + * -> SessionVariablesRelationProvider + */ + +/** + * InformationSchema object provides bootstrap and utility functions. + */ +object InformationSchema { + + /** + * Register INFORMATION_SCHEMA database. SessionCatalog.catalog invokes this function. + */ + def registerInformationSchema(sparkSession: SparkSession): Unit = { +sparkSession.sql(s"CREATE DATABASE IF NOT EXISTS $INFORMATION_SCHEMA_DATABASE") --- End diff -- I am lost. A database `INFORMATION_SCHEMA_DATABASE` is created through a SQL command, but the function `databaseExists` always returns true for the database for `INFORMATION_SCHEMA_DATABASE`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71102107 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- Ah ic. My worry is that Maven and SBT has base path where the tests are running, and the test will fail in Maven. How about we define the output directory as a variable in the beginning of the suite, and then use that when we write output out? For reading input we simply use "sqlgen/$answerFile.sql" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14236: [SPARK-16588][SQL] Missed API fix for a function ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14236 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14235 Now, the remaining issue is using `getResource` to save the golden files. I left a comment about that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14235 **[Test build #62448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62448/consoleFull)** for PR 14235 at commit [`38f52ce`](https://github.com/apache/spark/commit/38f52cea6fbd53cf66429d396663708afcb1f193). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14236: [SPARK-16588][SQL] Missed API fix for a function name mi...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14236 @rxin Sure, Thank you very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14116 ```Scala val catalog = spark.sessionState.catalog catalog.setCurrentDatabase(SessionCatalog.INFORMATION_SCHEMA_DATABASE) sql("CREATE TABLE my_tab (age INT, name STRING)") ``` We can set the current database to `INFORMATION_SCHEMA_DATABASE`, but we got an error when we are trying to create a table in this database. ``` Database 'information_schema' not found; org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'information_schema' not found; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71101786 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- For this one, here is the code and the result in IntelliJ. ``` val path = s"src/test/resources/sqlgen/$answerFile.sql" val path2 = getClass.getClassLoader.getResource("sqlgen/README.md") ``` ``` file:/Users/dongjoon/SPARK-CHECK-GEN-SQL/sql/hive/target/scala-2.11/test-classes/sqlgen/README.md ``` Reading resource files are not problem, but we need the src directory for saving new golden SQL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14236: [SPARK-16588][SQL] Missed API fix for a function name mi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14236 Since it is a small change I will just push a commit myself. Thanks for noticing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14236: [SPARK-16588][SQL] Missed API fix for a function ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14236#discussion_r71101740 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -977,7 +977,10 @@ object functions { * * @group normal_funcs * @since 1.4.0 + * @deprecated As of 2.0.0, replaced by `monotonically_increasing_id`. This will be removed in + *Spark 2.1. */ + @deprecated("Use monotonically_increasing_id. This will be removed in Spark 2.1.", "2.0.0") --- End diff -- we probably don't want to remove it in 2.1. can you just say "Use monotonically_increasing_id()" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71101535 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- I'll try again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71101452 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- What I mean is `sqlgen` is currently `sql/hive/src/test/resources/sqlgen`. So, `sql/hive/src/test/resources/sqlgen/README.md` is also copied like the other resources into `test-classes`. Probably, did I miss something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14116#discussion_r71101399 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -484,8 +521,11 @@ class SessionCatalog( val dbTables = externalCatalog.listTables(dbName, pattern).map { t => TableIdentifier(t, Some(dbName)) } synchronized { - val _tempTables = StringUtils.filterPattern(tempTables.keys.toSeq, pattern) + var _tempTables = StringUtils.filterPattern(tempTables.keys.toSeq, pattern) .map { t => TableIdentifier(t) } + if (db != INFORMATION_SCHEMA_DATABASE) { --- End diff -- This part is not clear to me. What happens if users want to list tables in database `INFORMATION_SCHEMA_DATABASE`, it will return empty? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71101198 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { |# Converted SQL query string: |$convertedSQL | - |# Original HiveQL query string: - |$hiveQl + |# Original SQL query string: + |$sqlString | |# Resolved query plan: |${df.queryExecution.analyzed.treeString} """.stripMargin, cause) } } + test("checkSQL") { --- End diff -- Oh, I see. I'll fix like that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71101205 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- What do you mean? What I meant is to create a README.md file in sqlgen directory, and at runtime we can get the location of sqlgen directory by getResource("sqlgen/README.md") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71101173 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +88,58 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" + val separator = "-" * 80 + if (saveQuery) { +val header = "-- This file is automatically generated by LogicalPlanToSQLSuite." +val answerText = s"$header\n${originalSQL.trim()}\n${separator}\n$normalizedGenSQL\n" +Files.write(Paths.get(path), answerText.getBytes(StandardCharsets.UTF_8)) + } else { +val answerText = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8) +val sqls = answerText.split(separator) +if (sqls.length == 2) { --- End diff -- No problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71101148 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" --- End diff -- Ur, I think any resource files under `resources` are copied into `test-classes` with classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14244 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62447/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14244 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14244 **[Test build #62447 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62447/consoleFull)** for PR 14244 at commit [`871d358`](https://github.com/apache/spark/commit/871d358b3b24b5b2acf46f5d882f4b8614fdd670). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14116#discussion_r71101023 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -471,6 +480,34 @@ class SessionCatalog( } /** + * Normalize TableIdentifier by consistently ensuring the following two rules. + * 1. System-side temporary tables should have None as database. + * 2. System-side temporary tables should have prefixed table names. + * Currently, only INFORMATION_SCHEMA tables are system-side temporary tables, and this function + * returns TableIdentifier("information_schema.databases", None). + */ + protected def normalizeTableIdentifier(name: TableIdentifier): TableIdentifier = synchronized { +if (name.database.isDefined) { + if (name.database.contains(INFORMATION_SCHEMA_DATABASE)) { +TableIdentifier(s"$INFORMATION_SCHEMA_DATABASE.${name.table}", None) + } else { +name + } +} else { + val tableName = formatTableName(name.table) + if (tableName.startsWith(INFORMATION_SCHEMA_DATABASE + ".")) { --- End diff -- Can you explain which cases will enter this processing? Is that possible we could hit backtick-quoted `INFORMATION_SCHEMA_DATABASE ` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14116#discussion_r71100937 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -471,6 +480,34 @@ class SessionCatalog( } /** + * Normalize TableIdentifier by consistently ensuring the following two rules. + * 1. System-side temporary tables should have None as database. + * 2. System-side temporary tables should have prefixed table names. + * Currently, only INFORMATION_SCHEMA tables are system-side temporary tables, and this function + * returns TableIdentifier("information_schema.databases", None). --- End diff -- Please correct this description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71100950 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { |# Converted SQL query string: |$convertedSQL | - |# Original HiveQL query string: - |$hiveQl + |# Original SQL query string: + |$sqlString | |# Resolved query plan: |${df.queryExecution.analyzed.treeString} """.stripMargin, cause) } } + test("checkSQL") { --- End diff -- Actually maybe these should be 4 test cases instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71100943 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { |# Converted SQL query string: |$convertedSQL | - |# Original HiveQL query string: - |$hiveQl + |# Original SQL query string: + |$sqlString | |# Resolved query plan: |${df.queryExecution.analyzed.treeString} """.stripMargin, cause) } } + test("checkSQL") { --- End diff -- what i meant about inline comment was something like this ``` test("test behavior for the golden file test harness") { // 1. Test should fail if the SQL query cannot be parsed. val m = intercept[ParseException] { ... } assert(...) // 2. Test should fail if the golden file cannot be found ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71100930 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") --- End diff -- Sure. I think so. It would be great issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14116 General question 1: Should we disallow users to do most DDLs against `INFORMATION_SCHEMA`? For example, create functions, create tables, ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71100872 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -101,95 +149,125 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { |# Converted SQL query string: |$convertedSQL | - |# Original HiveQL query string: - |$hiveQl + |# Original SQL query string: + |$sqlString | |# Resolved query plan: |${df.queryExecution.analyzed.treeString} """.stripMargin, cause) } } + test("checkSQL") { --- End diff -- add some comment inline explaining what we are actually testing here - also maybe rename the test name to "test behavior for the golden file test harness" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71100858 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +88,58 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") +if (answerFile != null) { + val path = s"src/test/resources/sqlgen/$answerFile.sql" + val separator = "-" * 80 + if (saveQuery) { +val header = "-- This file is automatically generated by LogicalPlanToSQLSuite." +val answerText = s"$header\n${originalSQL.trim()}\n${separator}\n$normalizedGenSQL\n" +Files.write(Paths.get(path), answerText.getBytes(StandardCharsets.UTF_8)) + } else { +val answerText = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8) +val sqls = answerText.split(separator) +if (sqls.length == 2) { --- End diff -- shouldn't we do an assert here instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQL...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14235#discussion_r71100846 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -76,22 +85,51 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } - private def checkHiveQl(hiveQl: String): Unit = { -val df = sql(hiveQl) + // Used for generating new query answer files by saving + private val saveQuery = false + + /** + * Compare the generated SQL with the expected answer string. + * Note that there exists a normalization for both arguments for the convenience. + * - Remove the id from the generated attributes, e.g., `gen_attr_1` -> `gen_attr`. + */ + private def checkSQLStructure(originalSQL: String, convertedSQL: String, answerFile: String) = { +val normalizedGenSQL = convertedSQL.replaceAll("`gen_attr_\\d+`", "`gen_attr`") --- End diff -- Can we make gen_attr_YYY stable? (can submit a separate pr) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHE...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14116#discussion_r71100512 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -471,6 +480,34 @@ class SessionCatalog( } /** + * Normalize TableIdentifier by consistently ensuring the following two rules. + * 1. System-side temporary tables should have None as database. --- End diff -- `System-wide`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user deanchen commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r71100480 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { + /** + * Transforms a single InternalRow to CSV using Univocity + * + * @param rowSchema the schema object used for conversion + * @param writer a CsvWriter object + * @param headers headers to write + * @param writeHeader true if it needs to write header + * @param options CSVOptions object containing options + * @param row The row to convert + */ + def apply( + rowSchema: StructType, + writer: CsvWriter, + headers: Array[String], + writeHeader: Boolean, + options: CSVOptions)(row: InternalRow): Unit = { +val tokens = { + row.toSeq(rowSchema).map { field => +// TODO: It seems all the data types are not able to be represented by `toString`. +// For example, `DateType` and `TimestampType` are being long values as timestamps. --- End diff -- ah thanks, commented on that PR. Glad to see someone showing some love to Spark's csv datasource! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13912: [SPARK-16216][SQL] CSV data source supports custom date ...
Github user deanchen commented on the issue: https://github.com/apache/spark/pull/13912 @srowen @rxin Would love to see this get merged as this has been a pain point for us. Not a fan of timezoneless dates as an engineer but the need to passthrough or write timezoneless dates to csv's has been a necessary task due to a variety of reasons in the past. We use Spark to parse large amounts of daily financial asset data with pre-agreed upon date conventions. Most of the dates we deal doesn't belong to a timezone and are treated more like indices with the maximum granularity of a day. The CSV reports we produce for our customers do not ever contain timezone as a result and almost all dates as passthrough from the original timezoneless dates. This is not just a common pattern with financial data, but also common practice in large companies I've worked at in the past. A large tech company I've worked at in the past used the convention of MST date and datetime for all internal systems(database included). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14132 I finished my third pass. Will review it again when your new version is ready. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71100213 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -755,4 +755,243 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkHiveQl("select * from orc_t") } } + + test("broadcast hint on single table") { +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM (SELECT `id` AS `gen_attr` +|FROM `default`.`parquet_t0`) +|AS gen_subquery_0) +| AS parquet_t0 +|""".stripMargin) + +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0, parquet_t0) */ * FROM parquet_t0", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM (SELECT `id` AS `gen_attr` +|FROM `default`.`parquet_t0`) +|AS gen_subquery_0) +| AS parquet_t0 +|""".stripMargin) + +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0 as a", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM ((SELECT `id` AS `gen_attr` +| FROM `default`.`parquet_t0`) +| AS gen_subquery_0) +|AS a) +| AS a + """.stripMargin) + +checkGeneratedSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t0) */ * --- End diff -- This should be `T. parquet_t0`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14102: [SPARK-16434][SQL] Avoid per-record type dispatch in JSO...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14102 @yhuai Thank you for your review! I will try to address all your comments first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14102#discussion_r71099616 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala --- @@ -35,184 +34,306 @@ import org.apache.spark.util.Utils private[json] class SparkSQLJsonProcessingException(msg: String) extends RuntimeException(msg) -object JacksonParser extends Logging { +private[sql] class JacksonParser(schema: StructType, options: JSONOptions) extends Logging { + import com.fasterxml.jackson.core.JsonToken._ - def parse( - input: RDD[String], - schema: StructType, - columnNameOfCorruptRecords: String, - configOptions: JSONOptions): RDD[InternalRow] = { + // A `ValueConverter` is responsible for converting a value from `JsonParser` + // to a value in a field for `InternalRow`. + private type ValueConverter = (JsonParser) => Any -input.mapPartitions { iter => - parseJson(iter, schema, columnNameOfCorruptRecords, configOptions) + // `ValueConverter`s for the root schema for all fields in the schema + private val rootConverter: ValueConverter = makeRootConverter(schema) + + private val factory = new JsonFactory() + options.setJacksonOptions(factory) + + private def failedConversion( + parser: JsonParser, + dataType: DataType): Any = parser.getCurrentToken match { +case null | VALUE_NULL => + null + +case _ if parser.getTextLength < 1 => + // guard the non string type + null + +case token => +// We cannot parse this token based on the given data type. So, we throw a +// SparkSQLJsonProcessingException and this exception will be caught by +// parseJson method. +throw new SparkSQLJsonProcessingException( + s"Failed to parse a value for data type $dataType (current token: $token).") + } + + private def failedRecord(record: String): Seq[InternalRow] = { +// create a row even if no corrupt record column is present +if (options.failFast) { + throw new RuntimeException(s"Malformed line in FAILFAST mode: $record") +} +if (options.dropMalformed) { + logWarning(s"Dropping malformed line: $record") + Nil +} else { + val row = new GenericMutableRow(schema.length) + for (corruptIndex <- schema.getFieldIndex(options.columnNameOfCorruptRecord)) { +require(schema(corruptIndex).dataType == StringType) +row.update(corruptIndex, UTF8String.fromString(record)) + } + Seq(row) } } /** - * Parse the current token (and related children) according to a desired schema - * This is a wrapper for the method `convertField()` to handle a row wrapped - * with an array. + * Create a converter which converts the JSON documents held by the `JsonParser` + * to a value according to a desired schema. This is a wrapper for the method + * `makeConverter()` to handle a row wrapped with an array. */ - def convertRootField( - factory: JsonFactory, - parser: JsonParser, - schema: DataType): Any = { -import com.fasterxml.jackson.core.JsonToken._ -(parser.getCurrentToken, schema) match { - case (START_ARRAY, st: StructType) => -// SPARK-3308: support reading top level JSON arrays and take every element -// in such an array as a row -convertArray(factory, parser, st) - - case (START_OBJECT, ArrayType(st, _)) => -// the business end of SPARK-3308: -// when an object is found but an array is requested just wrap it in a list -convertField(factory, parser, st) :: Nil + def makeRootConverter(dataType: DataType): ValueConverter = dataType match { +case st: StructType => + // SPARK-3308: support reading top level JSON arrays and take every element + // in such an array as a row + val elementConverter = makeConverter(st) + val fieldConverters = st.map(_.dataType).map(makeConverter) + (parser: JsonParser) => parser.getCurrentToken match { +case START_OBJECT => convertObject(parser, st, fieldConverters) +case START_ARRAY => convertArray(parser, elementConverter) +case _ => failedConversion(parser, st) + } - case _ => -convertField(factory, parser, schema) -} +case ArrayType(st: StructType, _) => + // the business end of SPARK-3308: + // when an object is found but an array is requested just wrap it in a list + val elementConverter = makeConverter(st) + val fieldConverters = st.map(_.dataType
[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14235 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62446/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14235 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14235: [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14235 **[Test build #62446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62446/consoleFull)** for PR 14235 at commit [`e2a7ac4`](https://github.com/apache/spark/commit/e2a7ac445e7f0288d384996dc0e36195e5849c5a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14102#discussion_r71099344 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala --- @@ -35,184 +34,306 @@ import org.apache.spark.util.Utils private[json] class SparkSQLJsonProcessingException(msg: String) extends RuntimeException(msg) -object JacksonParser extends Logging { +private[sql] class JacksonParser(schema: StructType, options: JSONOptions) extends Logging { + import com.fasterxml.jackson.core.JsonToken._ - def parse( - input: RDD[String], - schema: StructType, - columnNameOfCorruptRecords: String, - configOptions: JSONOptions): RDD[InternalRow] = { + // A `ValueConverter` is responsible for converting a value from `JsonParser` + // to a value in a field for `InternalRow`. + private type ValueConverter = (JsonParser) => Any -input.mapPartitions { iter => - parseJson(iter, schema, columnNameOfCorruptRecords, configOptions) + // `ValueConverter`s for the root schema for all fields in the schema + private val rootConverter: ValueConverter = makeRootConverter(schema) + + private val factory = new JsonFactory() + options.setJacksonOptions(factory) + + private def failedConversion( + parser: JsonParser, + dataType: DataType): Any = parser.getCurrentToken match { +case null | VALUE_NULL => + null + +case _ if parser.getTextLength < 1 => + // guard the non string type + null --- End diff -- These two cases are only the ones which do not care of `DataType`, it was [here](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L83-L84) and [here](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L93-L95). So, I put them here because all the `ValueConverter` should treat this case. I can move them into `ValueConverter` if it looks not great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71099153 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") +val tid = if (names.length > 1) { + TableIdentifier(names(1), Some(names(0))) +} else { + TableIdentifier(param, None) +} +try { + catalog.lookupRelation(tid) + + var stop = false + resolvedChild = resolvedChild.transformDown { +case r @ BroadcastHint(SubqueryAlias(t, _)) + if !stop && resolver(t, tid.identifier) => + stop = true + r +case r @ SubqueryAlias(t, _) if !stop && resolver(t, tid.identifier) => + stop = true + BroadcastHint(r) --- End diff -- When `BroadcastHint` is crossing `SubqueryAlias`, we need to remove the qualifier. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71099090 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) +case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) => --- End diff -- When you moving `Hint` and crossing a `SubqueryAlias`, we should consider adding the qualifier back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14102#discussion_r71099098 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONOptions.scala --- @@ -51,7 +53,8 @@ private[sql] class JSONOptions( parameters.get("allowBackslashEscapingAnyCharacter").map(_.toBoolean).getOrElse(false) val compressionCodec = parameters.get("compression").map(CompressionCodecs.getCodecClassName) private val parseMode = parameters.getOrElse("mode", "PERMISSIVE") - val columnNameOfCorruptRecord = parameters.get("columnNameOfCorruptRecord") + val columnNameOfCorruptRecord = +parameters.getOrElse("columnNameOfCorruptRecord", sqlConf.columnNameOfCorruptRecord) --- End diff -- Oh, actually, the reason was not only to remove duplicated logics. Another reason is that we are passing `columnNameOfCorruptRecords` to `InferSchema.infer(...)` and `JacksonParser.parse(...)` ```scala def infer( json: RDD[String], columnNameOfCorruptRecords: String, configOptions: JSONOptions): StructType = { ... ``` ```scala def parse( input: RDD[String], schema: StructType, columnNameOfCorruptRecords: String, configOptions: JSONOptions): RDD[InternalRow] = { ... ``` although `columnNameOfCorruptRecords` is an option in `JSONOptions`. I would revert this change if it still does not sound good but I just wanted to let you know just in case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098985 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) +case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case ll @ LocalLimit(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = ll.copy(child = hintChild)) +case f @ Filter(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = f.copy(child = hintChild)) +case a @ Aggregate(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = a.copy(child = hintChild)) +case s @ Sort(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case g @ Generate(_, _, _, _, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = g.copy(child = hintChild)) +// Set operation is not allowed to be across. UNION/INTERCEPT/EXCEPT --- End diff -- Is this a restriction? Any reason? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098966 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. --- End diff -- You do not need to enumerate all the types. This depends on the tree structure. `UnaryNode`, `BinaryNode` and so on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098821 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => --- End diff -- uh, your solution depends the rule `AddSubquery`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14028 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14028 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62445/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14028 **[Test build #62445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62445/consoleFull)** for PR 14028 at commit [`6570a98`](https://github.com/apache/spark/commit/6570a9874e60ecb9366ea37a0e5dfe06b821dc62). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098603 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => --- End diff -- Why you always can have a `SubqueryAlias` above `Project`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098299 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. --- End diff -- Can you improve it? This is hard to maintain. For example, why `LocalLimit` is included but `GlobalLimit` is not included? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098109 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- Oh, you mean a quoted style. It's a different story. Let me check that in Hive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14244: [SPARK-16598] [SQL] [TEST] Added a test case for verifyi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14244 **[Test build #62447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62447/consoleFull)** for PR 14244 at commit [`871d358`](https://github.com/apache/spark/commit/871d358b3b24b5b2acf46f5d882f4b8614fdd670). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098026 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- I intentionally remove that from this PR. You cannot get all in a single PR. I think you know Oracle Global Hint (you already asked to support view_name.table_name.), you can do better than me at that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71098009 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -87,6 +87,7 @@ class Analyzer( EliminateUnions), Batch("Resolution", fixedPoint, ResolveRelations :: + SubstituteHints :: --- End diff -- Have you tried to move it after the batch `Resolution`? Any error you hit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13988: [SPARK-16101][SQL] Refactoring CSV data source to...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13988#discussion_r71097989 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.csv + +import com.univocity.parsers.csv.{CsvWriter, CsvWriterSettings} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.StructType + +/** + * Converts a sequence of string to CSV string + */ +private[csv] object UnivocityGenerator extends Logging { + /** + * Transforms a single InternalRow to CSV using Univocity + * + * @param rowSchema the schema object used for conversion + * @param writer a CsvWriter object + * @param headers headers to write + * @param writeHeader true if it needs to write header + * @param options CSVOptions object containing options + * @param row The row to convert + */ + def apply( + rowSchema: StructType, + writer: CsvWriter, + headers: Array[String], + writeHeader: Boolean, + options: CSVOptions)(row: InternalRow): Unit = { +val tokens = { + row.toSeq(rowSchema).map { field => +// TODO: It seems all the data types are not able to be represented by `toString`. +// For example, `DateType` and `TimestampType` are being long values as timestamps. --- End diff -- Actually, I opened another PR here, https://github.com/apache/spark/pull/13912. Maybe it is about this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71097984 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- Yep. You can try to another PR. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71097965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- See the PR: https://github.com/apache/spark/pull/14244 I think we should ensure that all the cases in `TableIdentifierParserSuite.scala` if you do not want to parse it in `Analyzer` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14244: [SPARK-16598] [SQL] [TEST] Added a test case for ...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/14244 [SPARK-16598] [SQL] [TEST] Added a test case for verifying the table identifier parsing What changes were proposed in this pull request? So far, the test cases of `TableIdentifierParserSuite` do not cover the quoted cases. We should add one for avoiding regression. How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark quotedIdentifiers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14244 commit 871d358b3b24b5b2acf46f5d882f4b8614fdd670 Author: gatorsmile Date: 2016-07-18T03:29:56Z added a test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71097771 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- Let me try the quoted identifiers in the parser. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71097752 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- How about Oracle? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14174 hey @sameeragarwal, the diff of the generated code can be found here (with the left side for vectorized hashmap and right side for the new, row-based one): https://gist.github.com/ooq/d25d4de4c445752c6e1a810262c2e5c1/revisions?diff=split Below is my code to generate them: ``` test("generated code comparison for vectorized vs. rowbased") { val N = 20 << 23 sparkSession.conf.set("spark.sql.codegen.wholeStage", "true") sparkSession.conf.set("spark.sql.codegen.aggregate.map.columns.max", "30") sparkSession.range(N) .selectExpr( "id & 1023 as k1", "cast (id & 1023 as string) as k2") .createOrReplaceTempView("test") // dataframe/query val query = sparkSession.sql("select count(k1), sum(k1) from test group by k1, k2") // vectorized sparkSession.conf.set("spark.sql.codegen.aggregate.map.enforce.impl", "vectorized") query.queryExecution.debug.codegen() // row based sparkSession.conf.set("spark.sql.codegen.aggregate.map.enforce.impl", "rowbased") query.queryExecution.debug.codegen() } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71097403 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -87,6 +87,7 @@ class Analyzer( EliminateUnions), Batch("Resolution", fixedPoint, ResolveRelations :: + SubstituteHints :: --- End diff -- That's a good point. What is the best policy for a `Rule` with a fixedPoint `Once`? To minimize the side effect of `Hint` for other rules, I put it at the very next of `ResolveRelation`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71097256 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- Currently, the parser does not allow to specify database name. (I didn't commit the code since I found that Hive doesn't, too.) In case of the parser acceptance, it's valid for all the case because the parser will accept only identifiers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14102: [SPARK-16434][SQL] Avoid per-record type dispatch...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14102#discussion_r71097210 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala --- @@ -35,184 +34,306 @@ import org.apache.spark.util.Utils private[json] class SparkSQLJsonProcessingException(msg: String) extends RuntimeException(msg) -object JacksonParser extends Logging { +private[sql] class JacksonParser(schema: StructType, options: JSONOptions) extends Logging { + import com.fasterxml.jackson.core.JsonToken._ - def parse( - input: RDD[String], - schema: StructType, - columnNameOfCorruptRecords: String, - configOptions: JSONOptions): RDD[InternalRow] = { + // A `ValueConverter` is responsible for converting a value from `JsonParser` + // to a value in a field for `InternalRow`. + private type ValueConverter = (JsonParser) => Any -input.mapPartitions { iter => - parseJson(iter, schema, columnNameOfCorruptRecords, configOptions) + // `ValueConverter`s for the root schema for all fields in the schema + private val rootConverter: ValueConverter = makeRootConverter(schema) + + private val factory = new JsonFactory() + options.setJacksonOptions(factory) + + private def failedConversion( + parser: JsonParser, + dataType: DataType): Any = parser.getCurrentToken match { +case null | VALUE_NULL => + null + +case _ if parser.getTextLength < 1 => + // guard the non string type + null + +case token => +// We cannot parse this token based on the given data type. So, we throw a +// SparkSQLJsonProcessingException and this exception will be caught by +// parseJson method. +throw new SparkSQLJsonProcessingException( + s"Failed to parse a value for data type $dataType (current token: $token).") + } + + private def failedRecord(record: String): Seq[InternalRow] = { +// create a row even if no corrupt record column is present +if (options.failFast) { + throw new RuntimeException(s"Malformed line in FAILFAST mode: $record") +} +if (options.dropMalformed) { + logWarning(s"Dropping malformed line: $record") + Nil +} else { + val row = new GenericMutableRow(schema.length) + for (corruptIndex <- schema.getFieldIndex(options.columnNameOfCorruptRecord)) { +require(schema(corruptIndex).dataType == StringType) +row.update(corruptIndex, UTF8String.fromString(record)) + } + Seq(row) } } /** - * Parse the current token (and related children) according to a desired schema - * This is a wrapper for the method `convertField()` to handle a row wrapped - * with an array. + * Create a converter which converts the JSON documents held by the `JsonParser` + * to a value according to a desired schema. This is a wrapper for the method + * `makeConverter()` to handle a row wrapped with an array. */ - def convertRootField( - factory: JsonFactory, - parser: JsonParser, - schema: DataType): Any = { -import com.fasterxml.jackson.core.JsonToken._ -(parser.getCurrentToken, schema) match { - case (START_ARRAY, st: StructType) => -// SPARK-3308: support reading top level JSON arrays and take every element -// in such an array as a row -convertArray(factory, parser, st) - - case (START_OBJECT, ArrayType(st, _)) => -// the business end of SPARK-3308: -// when an object is found but an array is requested just wrap it in a list -convertField(factory, parser, st) :: Nil + def makeRootConverter(dataType: DataType): ValueConverter = dataType match { +case st: StructType => + // SPARK-3308: support reading top level JSON arrays and take every element + // in such an array as a row + val elementConverter = makeConverter(st) + val fieldConverters = st.map(_.dataType).map(makeConverter) + (parser: JsonParser) => parser.getCurrentToken match { +case START_OBJECT => convertObject(parser, st, fieldConverters) +case START_ARRAY => convertArray(parser, elementConverter) +case _ => failedConversion(parser, st) + } - case _ => -convertField(factory, parser, schema) -} +case ArrayType(st: StructType, _) => + // the business end of SPARK-3308: + // when an object is found but an array is requested just wrap it in a list + val elementConverter = makeConverter(st) + val fieldConverters = st.map(_.dataType).map(