[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-02 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90723948
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
--- End diff --

That's a pretty big change, considering how many classes depend on the 
eager behavior of `InMemoryFileIndex`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90637793
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
--- End diff --

Then can we just make the `InMemoryFileIndex` scan the file lazily? If we 
only need to infer the schema and partition columns, it should not do the scan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90591997
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
--- End diff --

You do need to pass it in though.

```
val fileCatalog = if 
(sparkSession.sqlContext.conf.manageFilesourcePartitions &&
catalogTable.isDefined && 
catalogTable.get.tracksPartitionsInCatalog) {
  new CatalogFileIndex(
sparkSession,
catalogTable.get,
catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L))
} else {
  new InMemoryFileIndex(sparkSession, globbedPaths, options, 
Some(partitionSchema))
}
```

Otherwise, this code will perform a full filesystem scan, independent of 
the other change to prevent getOrInferFileFormatSchema from performing a scan 
as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90590350
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
--- End diff --

You don't want to do that though. Resolve relation also does not always 
scan the filesystem if you pass in a user defined schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90583259
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
--- End diff --

I think it's fine to create an `InMemoryFileIndex` for this case, as we 
call `DataSource.resolveRelation` just to infer the schema and partition 
columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90582325
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
 className = table.provider.get,
 bucketSpec = table.bucketSpec,
-options = table.storage.properties ++ pathOption).resolveRelation()
+options = table.storage.properties ++ pathOption,
+catalogTable = Some(uncreatedTable)).resolveRelation()
--- End diff --

Otherwise, we will construct an InMemoryFileIndex which scans the 
filesystem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90582283
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
--- End diff --

You also need to pass catalogTable in, so that on line 390 of DataSource we 
create a CatalogFileIndex instead of an InMemoryFileIndex.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90581994
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
--- End diff --

I think this is true for all new tables. If the table is unpartitioned the 
flag is harmless.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90582300
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -312,7 +312,13 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
   pathToNonPartitionedTable,
   userSpecifiedSchema = Option("num int, str string"),
   userSpecifiedPartitionCols = partitionCols,
-  expectedSchema = new StructType().add("num", 
IntegerType).add("str", StringType),
+  expectedSchema = if (partitionCols.isDefined) {
--- End diff --

I think that would be testing something slightly different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90581943
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90580780
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -312,7 +312,13 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
   pathToNonPartitionedTable,
   userSpecifiedSchema = Option("num int, str string"),
   userSpecifiedPartitionCols = partitionCols,
-  expectedSchema = new StructType().add("num", 
IntegerType).add("str", StringType),
+  expectedSchema = if (partitionCols.isDefined) {
--- End diff --

shall we just change the test to use `str` as partition column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90580481
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
+  identifier = table.identifier.copy(
+database = Some(
+  
table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+  tracksPartitionsInCatalog = 
sparkSession.sessionState.conf.manageFilesourcePartitions)
 val dataSource: BaseRelation =
   DataSource(
 sparkSession = sparkSession,
 userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
+partitionColumns = table.partitionColumnNames,
 className = table.provider.get,
 bucketSpec = table.bucketSpec,
-options = table.storage.properties ++ pathOption).resolveRelation()
+options = table.storage.properties ++ pathOption,
+catalogTable = Some(uncreatedTable)).resolveRelation()
--- End diff --

why we need to pass in the `catalogTable` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90580133
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+// Fill in some default table options from the session conf
+val uncreatedTable = table.copy(
--- End diff --

how about `tableWithDefaultOptions`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90555068
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -565,7 +571,8 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
   val table = catalog.getTableMetadata(TableIdentifier("tbl"))
   assert(table.tableType == CatalogTableType.MANAGED)
   assert(table.provider == Some("parquet"))
-  assert(table.schema == new StructType().add("a", 
IntegerType).add("b", IntegerType))
+  // a is ordered last since it is a user-specified partitioning column
+  assert(table.schema == new StructType().add("b", 
IntegerType).add("a", IntegerType))
--- End diff --

@yhuai this is the minor behavior change we discussed about create table in 
2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-12-01 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90554917
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,20 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+val uncreatedTable = table.copy(
--- End diff --

Hm, that seems more brittle since you'd have to duplicate the logic. I 
added a comment describing why we need to do this here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-11-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16090#discussion_r90365422
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -58,13 +58,20 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
 // Create the relation to validate the arguments before writing the 
metadata to the metastore,
 // and infer the table schema and partition if users didn't specify 
schema in CREATE TABLE.
 val pathOption = table.storage.locationUri.map("path" -> _)
+val uncreatedTable = table.copy(
--- End diff --

can we do this before we pass the `CatalogTable`? e.g. in the parser and 
`DataFrameWriter`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...

2016-11-30 Thread ericl
GitHub user ericl opened a pull request:

https://github.com/apache/spark/pull/16090

[SPARK-18661] [SQL] Creating a partitioned datasource table should not scan 
all files for table

## What changes were proposed in this pull request?

Even though in 2.1 creating a partitioned datasource table will not 
populate the partition data by default (until the user issues MSCK REPAIR 
TABLE), it seems we still scan the filesystem for no good reason.
We should avoid doing this when the user specifies a schema.

## How was this patch tested?

Perf stat tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ericl/spark spark-18661

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16090.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16090


commit 51d2a4141a3e808616d291ed089b0b1d8172b80a
Author: Eric Liang 
Date:   2016-11-30T23:43:59Z

Wed Nov 30 15:43:59 PST 2016




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org