[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation

2020-01-15 Thread GitBox
rdblue commented on a change in pull request #26957: [SPARK-30314] Add 
identifier and catalog information to DataSourceV2Relation
URL: https://github.com/apache/spark/pull/26957#discussion_r367114890
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ##
 @@ -809,22 +809,27 @@ class Analyzer(
   .getOrElse(i)
 
   case desc @ DescribeTable(u: UnresolvedV2Relation, _) =>
-CatalogV2Util.loadRelation(u.catalog, u.tableName)
-.map(rel => desc.copy(table = rel))
-.getOrElse(desc)
+CatalogV2Util
+  .loadRelation(u.catalog, 
catalogManager.catalogIdentifier(u.catalog), u.tableName)
 
 Review comment:
   I think it's fine to reproduce the full identifier that was actually loaded 
by adding in catalog and namespace if they are missing. But I'm not sure it is 
a good idea to rely on this behavior since there is no guarantee.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation

2020-01-15 Thread GitBox
rdblue commented on a change in pull request #26957: [SPARK-30314] Add 
identifier and catalog information to DataSourceV2Relation
URL: https://github.com/apache/spark/pull/26957#discussion_r367071087
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ##
 @@ -32,12 +32,19 @@ import org.apache.spark.util.Utils
  * A logical plan representing a data source v2 table.
  *
  * @param table   The table that this relation represents.
+ * @param output the output attributes of this relation
+ * @param catalogIdentifier the string identifier for the catalog. None if no 
catalog is specified
+ * @param identifiers the identifiers for the v2 relation. For multipath 
dataframe, there could be
+ *more than one identifier or Nil if a V2 relation is 
instantiated using
+ *options
  * @param options The options for this table operation. It's used to create 
fresh [[ScanBuilder]]
  *and [[WriteBuilder]].
  */
 case class DataSourceV2Relation(
 table: Table,
 output: Seq[AttributeReference],
+catalogIdentifier: Option[String],
+identifiers: Seq[Identifier],
 
 Review comment:
   Two things:
   1. There should be only one identifier for a table because it is what 
identified the table
   2. Let's use null for `load(paths: String*)` because path-based tables are 
not designed or supported in v2


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation

2020-01-15 Thread GitBox
rdblue commented on a change in pull request #26957: [SPARK-30314] Add 
identifier and catalog information to DataSourceV2Relation
URL: https://github.com/apache/spark/pull/26957#discussion_r367070270
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala
 ##
 @@ -44,15 +44,27 @@ class CatalogManager(
   import CatalogManager.SESSION_CATALOG_NAME
 
   private val catalogs = mutable.HashMap.empty[String, CatalogPlugin]
+  // Map from catalog back to it's original name for easy name look up, we 
don't use the
+  // CatalogPlugin's name as it might be different from the catalog name 
depending on
+  // implementation.
+  private val catalogIdentifiers = mutable.HashMap.empty[CatalogPlugin, String]
 
 Review comment:
   The plugin name is intended to prevent needing to do this. While we do rely 
on the catalog not to report the wrong name, I think it is reasonable to use 
it. I'm not strongly against this, though. If you think this is cleaner we can 
do that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation

2020-01-14 Thread GitBox
rdblue commented on a change in pull request #26957: [SPARK-30314] Add 
identifier and catalog information to DataSourceV2Relation
URL: https://github.com/apache/spark/pull/26957#discussion_r366634963
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ##
 @@ -915,7 +924,11 @@ class Analyzer(
   AnalysisContext.get.relationCache.getOrElseUpdate(
 key, v1SessionCatalog.getRelation(v1Table.v1Table))
 case table =>
-  DataSourceV2Relation.create(table)
+  DataSourceV2Relation.create(
+table,
+catalogManager.catalogIdentifier(catalog),
 
 Review comment:
   Why are we using "identifier" for the catalog name? Everywhere else we call 
it the catalog name, so I don't see a reason to make this more complicated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation

2020-01-14 Thread GitBox
rdblue commented on a change in pull request #26957: [SPARK-30314] Add 
identifier and catalog information to DataSourceV2Relation
URL: https://github.com/apache/spark/pull/26957#discussion_r366634590
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala
 ##
 @@ -44,15 +44,27 @@ class CatalogManager(
   import CatalogManager.SESSION_CATALOG_NAME
 
   private val catalogs = mutable.HashMap.empty[String, CatalogPlugin]
+  // Map from catalog back to it's original name for easy name look up, we 
don't use the
+  // CatalogPlugin's name as it might be different from the catalog name 
depending on
+  // implementation.
+  private val catalogIdentifiers = mutable.HashMap.empty[CatalogPlugin, String]
 
 Review comment:
   Why is the reverse map needed? Can't we just call `CatalogPlugin.name`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation

2020-01-14 Thread GitBox
rdblue commented on a change in pull request #26957: [SPARK-30314] Add 
identifier and catalog information to DataSourceV2Relation
URL: https://github.com/apache/spark/pull/26957#discussion_r366634300
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ##
 @@ -32,12 +32,19 @@ import org.apache.spark.util.Utils
  * A logical plan representing a data source v2 table.
  *
  * @param table   The table that this relation represents.
+ * @param output the output attributes of this relation
+ * @param catalogIdentifier the string identifier for the catalog. None if no 
catalog is specified
+ * @param identifiers the identifiers for the v2 relation. For multipath 
dataframe, there could be
+ *more than one identifier or Nil if a V2 relation is 
instantiated using
+ *options
  * @param options The options for this table operation. It's used to create 
fresh [[ScanBuilder]]
  *and [[WriteBuilder]].
  */
 case class DataSourceV2Relation(
 table: Table,
 output: Seq[AttributeReference],
+catalogIdentifier: Option[String],
+identifiers: Seq[Identifier],
 
 Review comment:
   I don't think it is a good idea to have multiple identifiers here. DSv2 
doesn't yet cover how file-based tables should work and I think we need a 
design document for them. Adding multiple identifiers here in support of 
something that has undefined behavior seems premature.
   
   Design and behavior of path-based identifiers aside, a table should use one 
and only one identifier. When path-based tables are supported, I expect them to 
use a single `Identifier` with possibly more than one path embedded in it, like 
we do with the `paths` key.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org