[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19124 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137416501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- I think this PR passes the above two test cases, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137416371 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- DDLSuite and HiveDDLSuite have them here. - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L2045-L2070 - https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala#L1736 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137415411 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- Could you add test cases and ensure the partitioning columns with special characters work? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137390245 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- It's okay. Inside `checkDataSchemaFieldNames`, we only uses `table.dataSchema` like the following. ``` ParquetSchemaConverter.checkFieldNames(table.dataSchema) ``` For the partition columns, we have been allowing the special characters. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137383267 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- `newSchema ` also contains `partition schema`. How about partition schema? Do we have the same limits on it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137180809 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { --- End diff -- I'll rename this to `checkDataSchemaFieldNames`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137180619 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- Sorry, I found that your code is better. I'll updated it like yours. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137178908 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { + OrcFileFormat.checkFieldNames(table.dataSchema) +} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) { --- End diff -- AFAIK, it's only `org.apache.hadoop.hive.ql.io.orc.OrcSerde`. I checked again whether Apache ORC 1.4.0 have some renamed one under `hive-storage` API or not. But, it doesn't have it. For parquet, I'll handle that too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174591 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { --- End diff -- Yep! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174463 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- Yep. I agree that it's a little urgly now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174337 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- Is it okay to use the following? ```scala val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema val newDataSchema = StructType(catalogTable.dataSchema ++ columns) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) catalog.alterTableSchema( table, catalogTable.schema.copy(fields = reorderedSchema.toArray)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174215 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- Ur, actually. Excluding partition columns was intentional. Maybe, I used a misleading PR title and description here. So far, I checked `dataSchema` only. I think partition columns are okay because they are not a part of Parquet/ORC file schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137173190 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -130,10 +130,12 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case CreateTable(tableDesc, mode, None) if DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc) CreateDataSourceTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore) case CreateTable(tableDesc, mode, Some(query)) if query.resolved && DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc.copy(schema = query.schema)) CreateDataSourceTableAsSelectCommand(tableDesc, mode, query) case InsertIntoTable(l @ LogicalRelation(_: InsertableRelation, _, _, _), --- End diff -- Oh, I'll remove it from Hive serde table case. Checking on the existing table during INSERT INTO seems to be actually no-op. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137171798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -130,10 +130,12 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case CreateTable(tableDesc, mode, None) if DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc) CreateDataSourceTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore) case CreateTable(tableDesc, mode, Some(query)) if query.resolved && DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc.copy(schema = query.schema)) CreateDataSourceTableAsSelectCommand(tableDesc, mode, query) case InsertIntoTable(l @ LogicalRelation(_: InsertableRelation, _, _, _), --- End diff -- You did the check for Hive serde tables, but no check is done in data source tables? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137171539 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- ```Scala val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) DDLUtils.checkFieldNames(catalogTable.copy(schema = newSchema)) catalog.alterTableSchema(table, newSchema) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137171079 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- This should be moved to `verifyAlterTableAddColumn` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137170969 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { + OrcFileFormat.checkFieldNames(table.dataSchema) +} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) { --- End diff -- We could have different Parquet serde. For example, `parquet.hive.serde.ParquetHiveSerDe` and `org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe`. How about ORC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137170635 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { --- End diff -- This way is not right. Let use your previous way with a foreach loop ``` table.provider.foreach { _.toLowerCase(Locale.ROOT) match { case "hive" => ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137170172 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { + OrcFileFormat.checkFieldNames(table.dataSchema) +} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) { + ParquetSchemaConverter.checkFieldNames(table.dataSchema) +} else { + table.provider.get.toLowerCase(Locale.ROOT) match { --- End diff -- `table.provider` could be `None` in the previous versions of Spark. Thus, `.get` is risky. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137169805 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- Oh, right, that is java... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137169608 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- `parseName` looks not public though .. I don't like this line too but could not think of another alternative for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137169152 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- This seems being equal to call `TypeDescription.parseName(name)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137153437 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,38 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 ORC/Parquet table should not create invalid column names") { +Seq(" ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => + withTable("t21912") { +Seq("ORC", "PARQUET").foreach { source => + val m = intercept[AnalysisException] { +sql(s"CREATE TABLE t21912(`col$name` INT) USING $source") + }.getMessage + assert(m.contains(s"contains invalid character(s)")) + + val m2 = intercept[AnalysisException] { +sql(s"CREATE TABLE t21912 USING $source AS SELECT 1 `col$name`") + }.getMessage + assert(m2.contains(s"contains invalid character(s)")) + + withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { +val m3 = intercept[AnalysisException] { + sql(s"CREATE TABLE t21912(`col$name` INT) USING hive OPTIONS (fileFormat '$source')") +}.getMessage +assert(m3.contains(s"contains invalid character(s)")) + } +} + +// TODO: After SPARK-21929, we need to check ORC, too. +Seq("PARQUET").foreach { source => --- End diff -- I added only `Parquet` test case due to SPARK-21929. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137153372 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- For this command, it's not easy to get `CatalogTable` at `DataSourceStrategy`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org