[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#discussion_r367114890 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -809,22 +809,27 @@ class Analyzer( .getOrElse(i) case desc @ DescribeTable(u: UnresolvedV2Relation, _) => -CatalogV2Util.loadRelation(u.catalog, u.tableName) -.map(rel => desc.copy(table = rel)) -.getOrElse(desc) +CatalogV2Util + .loadRelation(u.catalog, catalogManager.catalogIdentifier(u.catalog), u.tableName) Review comment: I think it's fine to reproduce the full identifier that was actually loaded by adding in catalog and namespace if they are missing. But I'm not sure it is a good idea to rely on this behavior since there is no guarantee. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#discussion_r367071087 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ## @@ -32,12 +32,19 @@ import org.apache.spark.util.Utils * A logical plan representing a data source v2 table. * * @param table The table that this relation represents. + * @param output the output attributes of this relation + * @param catalogIdentifier the string identifier for the catalog. None if no catalog is specified + * @param identifiers the identifiers for the v2 relation. For multipath dataframe, there could be + *more than one identifier or Nil if a V2 relation is instantiated using + *options * @param options The options for this table operation. It's used to create fresh [[ScanBuilder]] *and [[WriteBuilder]]. */ case class DataSourceV2Relation( table: Table, output: Seq[AttributeReference], +catalogIdentifier: Option[String], +identifiers: Seq[Identifier], Review comment: Two things: 1. There should be only one identifier for a table because it is what identified the table 2. Let's use null for `load(paths: String*)` because path-based tables are not designed or supported in v2 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#discussion_r367070270 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala ## @@ -44,15 +44,27 @@ class CatalogManager( import CatalogManager.SESSION_CATALOG_NAME private val catalogs = mutable.HashMap.empty[String, CatalogPlugin] + // Map from catalog back to it's original name for easy name look up, we don't use the + // CatalogPlugin's name as it might be different from the catalog name depending on + // implementation. + private val catalogIdentifiers = mutable.HashMap.empty[CatalogPlugin, String] Review comment: The plugin name is intended to prevent needing to do this. While we do rely on the catalog not to report the wrong name, I think it is reasonable to use it. I'm not strongly against this, though. If you think this is cleaner we can do that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#discussion_r366634963 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -915,7 +924,11 @@ class Analyzer( AnalysisContext.get.relationCache.getOrElseUpdate( key, v1SessionCatalog.getRelation(v1Table.v1Table)) case table => - DataSourceV2Relation.create(table) + DataSourceV2Relation.create( +table, +catalogManager.catalogIdentifier(catalog), Review comment: Why are we using "identifier" for the catalog name? Everywhere else we call it the catalog name, so I don't see a reason to make this more complicated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#discussion_r366634590 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala ## @@ -44,15 +44,27 @@ class CatalogManager( import CatalogManager.SESSION_CATALOG_NAME private val catalogs = mutable.HashMap.empty[String, CatalogPlugin] + // Map from catalog back to it's original name for easy name look up, we don't use the + // CatalogPlugin's name as it might be different from the catalog name depending on + // implementation. + private val catalogIdentifiers = mutable.HashMap.empty[CatalogPlugin, String] Review comment: Why is the reverse map needed? Can't we just call `CatalogPlugin.name`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#discussion_r366634300 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ## @@ -32,12 +32,19 @@ import org.apache.spark.util.Utils * A logical plan representing a data source v2 table. * * @param table The table that this relation represents. + * @param output the output attributes of this relation + * @param catalogIdentifier the string identifier for the catalog. None if no catalog is specified + * @param identifiers the identifiers for the v2 relation. For multipath dataframe, there could be + *more than one identifier or Nil if a V2 relation is instantiated using + *options * @param options The options for this table operation. It's used to create fresh [[ScanBuilder]] *and [[WriteBuilder]]. */ case class DataSourceV2Relation( table: Table, output: Seq[AttributeReference], +catalogIdentifier: Option[String], +identifiers: Seq[Identifier], Review comment: I don't think it is a good idea to have multiple identifiers here. DSv2 doesn't yet cover how file-based tables should work and I think we need a design document for them. Adding multiple identifiers here in support of something that has undefined behavior seems premature. Design and behavior of path-based identifiers aside, a table should use one and only one identifier. When path-based tables are supported, I expect them to use a single `Identifier` with possibly more than one path embedded in it, like we do with the `paths` key. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org