[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290983004 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala ## @@ -36,17 +36,18 @@ import org.apache.spark.sql.util.SchemaUtils * Replaces [[UnresolvedRelation]]s if the plan is for direct query on files. */ class ResolveSQLOnFile(sparkSession: SparkSession) extends Rule[LogicalPlan] { + private def maybeSQLFile(u: UnresolvedRelation): Boolean = { -sparkSession.sessionState.conf.runSQLonFile && u.tableIdentifier.database.isDefined +sparkSession.sessionState.conf.runSQLonFile && u.multipartIdentifier.size == 2 Review comment: Sounds good. What should be the expected behavior for this test in sql.SQLQuerySuite."run sql directly on files"? ``` e = intercept[AnalysisException] { sql("select * from json.invalid_file") } assert(e.message.contains("Path does not exist")) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290902457 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala ## @@ -98,8 +100,22 @@ case class DataSourceResolution( case DropViewStatement(AsTableIdentifier(tableName), ifExists) => DropTableCommand(tableName, ifExists, isView = true, purge = false) + +case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalog), ident)) => Review comment: When attempting to move `DataSourceV2Relation` to Catalyst, I discovered [SPARK-27961](https://issues.apache.org/jira/browse/SPARK-27961) "DataSourceV2Relation should not have refresh method". The refresh method was added by #24401. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290902457 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala ## @@ -98,8 +100,22 @@ case class DataSourceResolution( case DropViewStatement(AsTableIdentifier(tableName), ifExists) => DropTableCommand(tableName, ifExists, isView = true, purge = false) + +case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalog), ident)) => Review comment: When attempting to move `DataSourceV2Relation` to Catalyst, I discovered [SPARK-27961](https://issues.apache.org/jira/browse/SPARK-27961) "DataSourceV2Relation should not have refresh method". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290595780 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -728,20 +737,16 @@ class Analyzer( //and the default database is only used to look up a view); // 3. Use the currentDb of the SessionCatalog. private def lookupTableFromCatalog( +tableIdentifier: TableIdentifier, u: UnresolvedRelation, defaultDatabase: Option[String] = None): LogicalPlan = { - val tableIdentWithDb = u.tableIdentifier.copy( -database = u.tableIdentifier.database.orElse(defaultDatabase)) + val tableIdentWithDb = tableIdentifier.copy( +database = tableIdentifier.database.orElse(defaultDatabase)) try { catalog.lookupRelation(tableIdentWithDb) } catch { -case e: NoSuchTableException => - u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}", e) -// If the database is defined and that database is not found, throw an AnalysisException. -// Note that if the database is not defined, it is possible we are looking up a temp view. -case e: NoSuchDatabaseException => - u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}, the " + -s"database ${e.db} doesn't exist.", e) Review comment: Unfortunately not possible since Analysis exception is thrown by checkAnalysis now: ``` case u: UnresolvedRelation => u.failAnalysis(s"Table or view not found: ${u.multipartIdentifier.quoted}") ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290592441 ## File path: sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala ## @@ -185,6 +186,8 @@ abstract class BaseSessionStateBuilder( V2WriteSupportCheck +: V2StreamingScanSupportCheck +: customCheckRules + +override protected def lookupCatalog(name: String): CatalogPlugin = session.catalog(name) Review comment: Good questions. Please check out @rdblue's #24768 first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290591930 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala ## @@ -98,8 +100,22 @@ case class DataSourceResolution( case DropViewStatement(AsTableIdentifier(tableName), ifExists) => DropTableCommand(tableName, ifExists, isView = true, purge = false) + +case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalog), ident)) => Review comment: Thanks for the suggestion. Let me give it some thought. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290591148 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ## @@ -181,6 +182,11 @@ case class CreateViewCommand( * Permanent views are not allowed to reference temp objects, including temp function and views */ private def verifyTemporaryObjectsNotExists(sparkSession: SparkSession): Unit = { +val lookup = new LookupCatalog { + override protected def lookupCatalog(name: String): CatalogPlugin = sparkSession.catalog(name) +} +import lookup._ Review comment: Good idea! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290583573 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ## @@ -153,7 +153,8 @@ object HiveAnalysis extends Rule[LogicalPlan] { case CreateTable(tableDesc, mode, None) if DDLUtils.isHiveTable(tableDesc) => CreateTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore) -case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => +case CreateTable(tableDesc, mode, Some(query)) +if DDLUtils.isHiveTable(tableDesc) && query.resolved => Review comment: This PR prevents lookupTableFromCatalog from throwing NoSuchTableException right away. Instead, it relies on checkAnalysis to throw an exception for UnresolvedRelation. The test hive.SQLQuerySuite."double nested data" would fail in the following sql without this change: ``` CREATE TABLE test_ctas_1234 AS SELECT * from notexists ``` HiveAnalysis gets to run before checkAnalysis, thus exposing this bug where query.output is used before query is resolved. So wouldn't say it is a totally separate issue. In addition, outside of this PR, it'd be hard to write a unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290550798 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala ## @@ -91,6 +92,8 @@ class HiveSessionStateBuilder(session: SparkSession, parentState: Option[Session V2WriteSupportCheck +: V2StreamingScanSupportCheck +: customCheckRules + +override protected def lookupCatalog(name: String): CatalogPlugin = session.catalog(name) Review comment: Same answer as above. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290550754 ## File path: sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala ## @@ -185,6 +186,8 @@ abstract class BaseSessionStateBuilder( V2WriteSupportCheck +: V2StreamingScanSupportCheck +: customCheckRules + +override protected def lookupCatalog(name: String): CatalogPlugin = session.catalog(name) Review comment: SparkSession.catalog returns CatalogPlugin for DSv2 while BaseSessionStateBuilder.catalog is a SessionCatalog. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290550116 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala ## @@ -98,8 +100,22 @@ case class DataSourceResolution( case DropViewStatement(AsTableIdentifier(tableName), ifExists) => DropTableCommand(tableName, ifExists, isView = true, purge = false) + +case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalog), ident)) => Review comment: All DSv2 resolutions are kept in DataSourceResolution that requires LookupCatalog mixin. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290522087 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala ## @@ -38,19 +38,32 @@ class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: Str /** * Holds the name of a relation that has yet to be looked up in a catalog. * - * @param tableIdentifier table name + * @param multipartIdentifier table name */ -case class UnresolvedRelation(tableIdentifier: TableIdentifier) - extends LeafNode { +case class UnresolvedRelation(multipartIdentifier: Seq[String]) extends LeafNode { + import org.apache.spark.sql.catalog.v2.CatalogV2Implicits._ /** Returns a `.` separated name for this relation. */ - def tableName: String = tableIdentifier.unquotedString + def tableName: String = multipartIdentifier.quoted override def output: Seq[Attribute] = Nil override lazy val resolved = false } +object UnresolvedRelation { + def apply(tableIdentifier: TableIdentifier): UnresolvedRelation = { Review comment: I am working on a PR for SPARK-27845 to add Insert. Almost done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290383294 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala ## @@ -50,11 +51,15 @@ object ResolveHints { * * This rule must happen before common table expressions. */ - class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan] { + class ResolveJoinStrategyHints( + conf: SQLConf, + catalogLookup: String => CatalogPlugin) extends Rule[LogicalPlan] with LookupCatalog { Review comment: Like suggested approach of passing LookupCatalog instance because it will make it easier to add feature to LookCatalog e.g. default catalog name. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290379810 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala ## @@ -50,11 +51,15 @@ object ResolveHints { * * This rule must happen before common table expressions. */ - class ResolveJoinStrategyHints(conf: SQLConf) extends Rule[LogicalPlan] { + class ResolveJoinStrategyHints( + conf: SQLConf, + catalogLookup: String => CatalogPlugin) extends Rule[LogicalPlan] with LookupCatalog { Review comment: Not used. Removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs
jzhuge commented on a change in pull request #24741: [SPARK-27322][SQL] DataSourceV2: Select from multiple catalogs URL: https://github.com/apache/spark/pull/24741#discussion_r290379652 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala ## @@ -35,18 +36,21 @@ import org.apache.spark.sql.util.SchemaUtils /** * Replaces [[UnresolvedRelation]]s if the plan is for direct query on files. */ -class ResolveSQLOnFile(sparkSession: SparkSession) extends Rule[LogicalPlan] { +class ResolveSQLOnFile(sparkSession: SparkSession) extends Rule[LogicalPlan] with LookupCatalog { + + override protected def lookupCatalog(name: String): CatalogPlugin = sparkSession.catalog(name) Review comment: Not used. Removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org