[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16090 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90723948 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, --- End diff -- That's a pretty big change, considering how many classes depend on the eager behavior of `InMemoryFileIndex`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90637793 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, --- End diff -- Then can we just make the `InMemoryFileIndex` scan the file lazily? If we only need to infer the schema and partition columns, it should not do the scan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90591997 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, --- End diff -- You do need to pass it in though. ``` val fileCatalog = if (sparkSession.sqlContext.conf.manageFilesourcePartitions && catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog) { new CatalogFileIndex( sparkSession, catalogTable.get, catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L)) } else { new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema)) } ``` Otherwise, this code will perform a full filesystem scan, independent of the other change to prevent getOrInferFileFormatSchema from performing a scan as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90590350 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, --- End diff -- You don't want to do that though. Resolve relation also does not always scan the filesystem if you pass in a user defined schema. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90583259 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, --- End diff -- I think it's fine to create an `InMemoryFileIndex` for this case, as we call `DataSource.resolveRelation` just to infer the schema and partition columns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90582325 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, className = table.provider.get, bucketSpec = table.bucketSpec, -options = table.storage.properties ++ pathOption).resolveRelation() +options = table.storage.properties ++ pathOption, +catalogTable = Some(uncreatedTable)).resolveRelation() --- End diff -- Otherwise, we will construct an InMemoryFileIndex which scans the filesystem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90582283 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, --- End diff -- You also need to pass catalogTable in, so that on line 390 of DataSource we create a CatalogFileIndex instead of an InMemoryFileIndex. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90581994 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) --- End diff -- I think this is true for all new tables. If the table is unpartitioned the flag is harmless. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90582300 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -312,7 +312,13 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { pathToNonPartitionedTable, userSpecifiedSchema = Option("num int, str string"), userSpecifiedPartitionCols = partitionCols, - expectedSchema = new StructType().add("num", IntegerType).add("str", StringType), + expectedSchema = if (partitionCols.isDefined) { --- End diff -- I think that would be testing something slightly different. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90581943 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90580780 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -312,7 +312,13 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { pathToNonPartitionedTable, userSpecifiedSchema = Option("num int, str string"), userSpecifiedPartitionCols = partitionCols, - expectedSchema = new StructType().add("num", IntegerType).add("str", StringType), + expectedSchema = if (partitionCols.isDefined) { --- End diff -- shall we just change the test to use `str` as partition column? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90580481 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( + identifier = table.identifier.copy( +database = Some( + table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))), + tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions) val dataSource: BaseRelation = DataSource( sparkSession = sparkSession, userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema), +partitionColumns = table.partitionColumnNames, className = table.provider.get, bucketSpec = table.bucketSpec, -options = table.storage.properties ++ pathOption).resolveRelation() +options = table.storage.properties ++ pathOption, +catalogTable = Some(uncreatedTable)).resolveRelation() --- End diff -- why we need to pass in the `catalogTable` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90580133 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +// Fill in some default table options from the session conf +val uncreatedTable = table.copy( --- End diff -- how about `tableWithDefaultOptions`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90555068 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -565,7 +571,8 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { val table = catalog.getTableMetadata(TableIdentifier("tbl")) assert(table.tableType == CatalogTableType.MANAGED) assert(table.provider == Some("parquet")) - assert(table.schema == new StructType().add("a", IntegerType).add("b", IntegerType)) + // a is ordered last since it is a user-specified partitioning column + assert(table.schema == new StructType().add("b", IntegerType).add("a", IntegerType)) --- End diff -- @yhuai this is the minor behavior change we discussed about create table in 2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90554917 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,20 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +val uncreatedTable = table.copy( --- End diff -- Hm, that seems more brittle since you'd have to duplicate the logic. I added a comment describing why we need to do this here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16090#discussion_r90365422 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -58,13 +58,20 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo // Create the relation to validate the arguments before writing the metadata to the metastore, // and infer the table schema and partition if users didn't specify schema in CREATE TABLE. val pathOption = table.storage.locationUri.map("path" -> _) +val uncreatedTable = table.copy( --- End diff -- can we do this before we pass the `CatalogTable`? e.g. in the parser and `DataFrameWriter` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16090: [SPARK-18661] [SQL] Creating a partitioned dataso...
GitHub user ericl opened a pull request: https://github.com/apache/spark/pull/16090 [SPARK-18661] [SQL] Creating a partitioned datasource table should not scan all files for table ## What changes were proposed in this pull request? Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. ## How was this patch tested? Perf stat tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericl/spark spark-18661 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16090.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16090 commit 51d2a4141a3e808616d291ed089b0b1d8172b80a Author: Eric LiangDate: 2016-11-30T23:43:59Z Wed Nov 30 15:43:59 PST 2016 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org