[GitHub] spark issue #18903: [SPARK-21590][SS]Window start time should support negati...
Github user KevinZwx commented on the issue: https://github.com/apache/spark/pull/18903 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19136 **[Test build #81441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81441/testReport)** for PR 19136 at commit [`a824d44`](https://github.com/apache/spark/commit/a824d44f9a4aac0518c5cd30893c34b36a094798). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r137175343 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,24 @@ def _fit(self, dataset): randCol = self.uid + "_rand" df = dataset.select("*", rand(seed).alias(randCol)) metrics = [0.0] * numModels + +pool = ThreadPool(processes=min(self.getParallelism(), numModels)) + for i in range(nFolds): validateLB = i * h validateUB = (i + 1) * h condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB) -validation = df.filter(condition) +validation = df.filter(condition).cache() train = df.filter(~condition) -models = est.fit(train, epm) -for j in range(numModels): -model = models[j] + +def singleTrain(index): +model = est.fit(train, epm[index]) # TODO: duplicate evaluator to take extra params from input -metric = eva.evaluate(model.transform(validation, epm[j])) -metrics[j] += metric/nFolds +metric = eva.evaluate(model.transform(validation, epm[index])) +metrics[index] += metric/nFolds + +pool.map(singleTrain, range(numModels)) --- End diff -- Oh, I think this work well. We already have PRs do similar things #19110 and #16774 . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 Could @gatorsmile and @HyukjinKwon please share some instructions for revised details on exception message? The current message indicates the reason of disallowance when users just select the `_corrupt_record`, and provides the alternative way to get corrupt records. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19050 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19050 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81439/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19050 **[Test build #81439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81439/testReport)** for PR 19050 at commit [`c1325fb`](https://github.com/apache/spark/commit/c1325fb9b1f8501b1a31b61e9b39bf1213b021f7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174591 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { --- End diff -- Yep! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174463 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- Yep. I agree that it's a little urgly now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174337 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- Is it okay to use the following? ```scala val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema val newDataSchema = StructType(catalogTable.dataSchema ++ columns) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) catalog.alterTableSchema( table, catalogTable.schema.copy(fields = reorderedSchema.toArray)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137174215 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- Ur, actually. Excluding partition columns was intentional. Maybe, I used a misleading PR title and description here. So far, I checked `dataSchema` only. I think partition columns are okay because they are not a part of Parquet/ORC file schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137173190 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -130,10 +130,12 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case CreateTable(tableDesc, mode, None) if DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc) CreateDataSourceTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore) case CreateTable(tableDesc, mode, Some(query)) if query.resolved && DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc.copy(schema = query.schema)) CreateDataSourceTableAsSelectCommand(tableDesc, mode, query) case InsertIntoTable(l @ LogicalRelation(_: InsertableRelation, _, _, _), --- End diff -- Oh, I'll remove it from Hive serde table case. Checking on the existing table during INSERT INTO seems to be actually no-op. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] ORC/Parquet table should not create i...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19124 Oh, thank you for review, @viirya, @HyukjinKwon and @gatorsmile ! I'll follow up your comments! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137171798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -130,10 +130,12 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case CreateTable(tableDesc, mode, None) if DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc) CreateDataSourceTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore) case CreateTable(tableDesc, mode, Some(query)) if query.resolved && DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc.copy(schema = query.schema)) CreateDataSourceTableAsSelectCommand(tableDesc, mode, query) case InsertIntoTable(l @ LogicalRelation(_: InsertableRelation, _, _, _), --- End diff -- You did the check for Hive serde tables, but no check is done in data source tables? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137171539 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- ```Scala val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) DDLUtils.checkFieldNames(catalogTable.copy(schema = newSchema)) catalog.alterTableSchema(table, newSchema) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137171079 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- This should be moved to `verifyAlterTableAddColumn` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137170969 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { + OrcFileFormat.checkFieldNames(table.dataSchema) +} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) { --- End diff -- We could have different Parquet serde. For example, `parquet.hive.serde.ParquetHiveSerDe` and `org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe`. How about ORC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137170635 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { --- End diff -- This way is not right. Let use your previous way with a foreach loop ``` table.provider.foreach { _.toLowerCase(Locale.ROOT) match { case "hive" => ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137170172 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -848,4 +851,19 @@ object DDLUtils { } } } + + private[sql] def checkFieldNames(table: CatalogTable): Unit = { +val serde = table.storage.serde +if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { + OrcFileFormat.checkFieldNames(table.dataSchema) +} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) { + ParquetSchemaConverter.checkFieldNames(table.dataSchema) +} else { + table.provider.get.toLowerCase(Locale.ROOT) match { --- End diff -- `table.provider` could be `None` in the previous versions of Spark. Thus, `.get` is risky. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137169805 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- Oh, right, that is java... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137169608 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- `parseName` looks not public though .. I don't like this line too but could not think of another alternative for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137169152 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.TypeDescription + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.StructType + +private[sql] object OrcFileFormat { + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") --- End diff -- This seems being equal to call `TypeDescription.parseName(name)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18935: [SPARK-9104][CORE] Expose Netty memory metrics in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18935 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18935: [SPARK-9104][CORE] Expose Netty memory metrics in Spark
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/18935 Merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17254: [SPARK-19917][SQL]qualified partition path stored in cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17254 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81437/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17254: [SPARK-19917][SQL]qualified partition path stored in cat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17254 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17254: [SPARK-19917][SQL]qualified partition path stored in cat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17254 **[Test build #81437 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81437/testReport)** for PR 17254 at commit [`36a3463`](https://github.com/apache/spark/commit/36a34632dbb000799c35727c00d1542d4bb1ce00). * This patch **fails PySpark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/18692 @cloud-fan : In event when the (set of join keys) is a superset of (child node's partitioning keys), its possible to avoid shuffle : https://github.com/apache/spark/pull/19054 ... this can help with 2 cases - when users unknowingly join over extra columns in addition to bucket columns - the one you mentioned (ie. inferred conditions). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19050 Thanks @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19050 **[Test build #81440 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81440/testReport)** for PR 19050 at commit [`8550828`](https://github.com/apache/spark/commit/85508287ca1b98f3a3c341efd3ac70f99b56bc73). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19050 LGTM pending Jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] ORC/Parquet table should not create i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81435/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] ORC/Parquet table should not create i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] ORC/Parquet table should not create i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19124 **[Test build #81435 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81435/testReport)** for PR 19124 at commit [`c6e9ab6`](https://github.com/apache/spark/commit/c6e9ab6291dda034fe39263202ea5bc2373cd86c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19050: [SPARK-21835][SQL] RewritePredicateSubquery shoul...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19050#discussion_r137167260 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala --- @@ -875,4 +876,70 @@ class SubquerySuite extends QueryTest with SharedSQLContext { assert(e.message.contains("cannot resolve '`a`' given input columns: [t.i, t.j]")) } } + + test("SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1") { +withTable("t1") { + withTempPath { path => +Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) +sql(s"CREATE TABLE t1 USING parquet LOCATION '${path.toURI}'") + +val sqlText = + """ +|SELECT * FROM t1 +|WHERE +|NOT EXISTS (SELECT * FROM t1) + """.stripMargin +val optimizedPlan = sql(sqlText).queryExecution.optimizedPlan +val join = optimizedPlan.collect { + case j: Join => j +}.head.asInstanceOf[Join] +assert(join.duplicateResolved) +assert(optimizedPlan.resolved) + } +} + } + + test("SPARK-21835: Join in correlated subquery should be duplicateResolved: case 2") { +withTable("t1", "t2", "t3") { + withTempPath { path => +val data = Seq((1, 1, 1), (2, 0, 2)) + +data.toDF("t1a", "t1b", "t1c").write.parquet(path.getCanonicalPath + "/t1") +data.toDF("t2a", "t2b", "t2c").write.parquet(path.getCanonicalPath + "/t2") +data.toDF("t3a", "t3b", "t3c").write.parquet(path.getCanonicalPath + "/t3") + +sql(s"CREATE TABLE t1 USING parquet LOCATION '${path.toURI}/t1'") +sql(s"CREATE TABLE t2 USING parquet LOCATION '${path.toURI}/t2'") +sql(s"CREATE TABLE t3 USING parquet LOCATION '${path.toURI}/t3'") + +val sqlText = + s""" + |SELECT * + |FROM (SELECT * + |FROM t2 + |WHERE t2c IN (SELECT t1c + | FROM t1 + | WHERE t1a = t2a) + |UNION + |SELECT * + |FROM t3 + |WHERE t3a IN (SELECT t2a + | FROM t2 + | UNION ALL + | SELECT t1a + | FROM t1 + | WHERE t1b > 0)) t4 + |WHERE t4.t2b IN (SELECT Min(t3b) + | FROM t3 + | WHERE t4.t2a = t3a) + """.stripMargin +val optimizedPlan = sql(sqlText).queryExecution.optimizedPlan +val joinNodes = optimizedPlan.collect { + case j: Join => j +}.map(_.asInstanceOf[Join]) +joinNodes.map(j => assert(j.duplicateResolved)) --- End diff -- ```Scala val joinNodes = optimizedPlan.collect { case j: Join => j } joinNodes.foreach(j => assert(j.duplicateResolved)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19050: [SPARK-21835][SQL] RewritePredicateSubquery shoul...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19050#discussion_r137167216 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala --- @@ -875,4 +876,70 @@ class SubquerySuite extends QueryTest with SharedSQLContext { assert(e.message.contains("cannot resolve '`a`' given input columns: [t.i, t.j]")) } } + + test("SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1") { +withTable("t1") { + withTempPath { path => +Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) +sql(s"CREATE TABLE t1 USING parquet LOCATION '${path.toURI}'") + +val sqlText = + """ +|SELECT * FROM t1 +|WHERE +|NOT EXISTS (SELECT * FROM t1) + """.stripMargin +val optimizedPlan = sql(sqlText).queryExecution.optimizedPlan +val join = optimizedPlan.collect { + case j: Join => j +}.head.asInstanceOf[Join] --- End diff -- ```Scala val join = optimizedPlan.collectFirst { case j: Join => j }.get ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19132 @ajbozarth I have updated the implementation which only access FS in FSHistoryServerProvider --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19131: [MINOR][SQL]remove unuse import class
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19131 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18966#discussion_r137164501 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -769,16 +769,27 @@ class CodegenContext { foldFunctions: Seq[String] => String = _.mkString("", ";\n", ";")): String = { val blocks = new ArrayBuffer[String]() val blockBuilder = new StringBuilder() +val defaultMaxLines = 100 +val maxLines = if (SparkEnv.get != null) { + SparkEnv.get.conf.getInt("spark.sql.codegen.expressions.maxCodegenLinesPerFunction", --- End diff -- I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18704: [SPARK-20783][SQL] Create ColumnVector to abstract exist...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18704 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19140 @redsanket can you please test this with a secure Hadoop environment using spark-submit (not Oozie), I don't want to bring in any regression here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r137162378 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,24 @@ def _fit(self, dataset): randCol = self.uid + "_rand" df = dataset.select("*", rand(seed).alias(randCol)) metrics = [0.0] * numModels + +pool = ThreadPool(processes=min(self.getParallelism(), numModels)) + for i in range(nFolds): validateLB = i * h validateUB = (i + 1) * h condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB) -validation = df.filter(condition) +validation = df.filter(condition).cache() train = df.filter(~condition) -models = est.fit(train, epm) -for j in range(numModels): -model = models[j] + +def singleTrain(index): +model = est.fit(train, epm[index]) # TODO: duplicate evaluator to take extra params from input -metric = eva.evaluate(model.transform(validation, epm[j])) -metrics[j] += metric/nFolds +metric = eva.evaluate(model.transform(validation, epm[index])) +metrics[index] += metric/nFolds + +pool.map(singleTrain, range(numModels)) --- End diff -- The actual fitting and evaluation methods run here might include CPU bound codes. So I am not sure if multithreading here can well boost the performance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r137162089 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,23 @@ def _fit(self, dataset): randCol = self.uid + "_rand" df = dataset.select("*", rand(seed).alias(randCol)) metrics = [0.0] * numModels + +pool = ThreadPool(processes=min(self.getParallelism(), numModels)) + for i in range(nFolds): validateLB = i * h validateUB = (i + 1) * h condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB) -validation = df.filter(condition) +validation = df.filter(condition).cache() --- End diff -- That's right, but seems we don't check if input dataset is cached or not here? Should we cache it if it is not cached? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19050 **[Test build #81439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81439/testReport)** for PR 19050 at commit [`c1325fb`](https://github.com/apache/spark/commit/c1325fb9b1f8501b1a31b61e9b39bf1213b021f7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19110 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19110 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81438/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19110 **[Test build #81438 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81438/testReport)** for PR 19110 at commit [`edcf85c`](https://github.com/apache/spark/commit/edcf85c08f25044520d43b919e0475e0f047001b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19050: [SPARK-21835][SQL] RewritePredicateSubquery shoul...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19050#discussion_r137159908 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -49,6 +49,30 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { } } + def dedupJoin(plan: LogicalPlan): LogicalPlan = { --- End diff -- ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19050: [SPARK-21835][SQL] RewritePredicateSubquery shoul...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19050#discussion_r137159871 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala --- @@ -875,4 +876,71 @@ class SubquerySuite extends QueryTest with SharedSQLContext { assert(e.message.contains("cannot resolve '`a`' given input columns: [t.i, t.j]")) } } + + test("SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1") { +withTable("t1") { + withTempPath { path => +Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) +sql(s"CREATE TABLE t1 USING parquet LOCATION '${path.toURI}'") + +val sqlText = + """ +|SELECT * FROM t1 +|WHERE +|NOT EXISTS (SELECT * FROM t1) + """.stripMargin +val ds = sql(sqlText) --- End diff -- Yes, missing this. I'll remove it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19050: [SPARK-21835][SQL] RewritePredicateSubquery shoul...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19050#discussion_r137159884 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -98,6 +122,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p) Project(p.output, Filter(newCond.get, inputPlan)) } + dedupJoin(rewritten) --- End diff -- Fair point. Will follow it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19050: [SPARK-21835][SQL] RewritePredicateSubquery shoul...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19050#discussion_r137159896 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -49,6 +49,30 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { } } + def dedupJoin(plan: LogicalPlan): LogicalPlan = { +plan transform { + case j @ Join(left, right, joinType, joinCond) => --- End diff -- Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19056: [SPARK-21765] Check that optimization doesn't affect isS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19056 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81434/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19056: [SPARK-21765] Check that optimization doesn't affect isS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19056 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19056: [SPARK-21765] Check that optimization doesn't affect isS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19056 **[Test build #81434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81434/testReport)** for PR 19056 at commit [`a3ec0f2`](https://github.com/apache/spark/commit/a3ec0f2cf3ec92aa30327c856820722ae7f22e7c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18692 > After adding the inferred join conditions, it might lead to the child node's partitioning NOT satisfying the JOIN node's requirements which otherwise could have. Isn't it an existing problem? the current constraint propagation framework infers as many predicates as possible, so we may already hit this problem. I think we should revisit the constraint propagation framework to think about how to avoid adding more shuffles, instead of stopping improving this framework to infer more predicates. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19129: [SPARK-13656][SQL] Delete spark.sql.parquet.cacheMetadat...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19129 So, it's removed before 2.0.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19110 **[Test build #81438 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81438/testReport)** for PR 19110 at commit [`edcf85c`](https://github.com/apache/spark/commit/edcf85c08f25044520d43b919e0475e0f047001b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18628: [SPARK-18061][ThriftServer] Add spnego auth suppo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18628 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19132 Thanks for your recommendation @ajbozarth .Could you put a link for your pr?For the problem you mentioned,i have thought about them. 1ãFsHistoryServer will always use FS to get event log 2ãFor spark ui, my implementation will not access FS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19110 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81436/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19110 **[Test build #81436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81436/testReport)** for PR 19110 at commit [`7d0849e`](https://github.com/apache/spark/commit/7d0849eae7601eb3e24240cb8462985e95932f85). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18628: [SPARK-18061][ThriftServer] Add spnego auth support for ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18628 Thanks @jiangxb1987 , let me merge it to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19110 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17254: [SPARK-19917][SQL]qualified partition path stored in cat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17254 **[Test build #81437 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81437/testReport)** for PR 17254 at commit [`36a3463`](https://github.com/apache/spark/commit/36a34632dbb000799c35727c00d1542d4bb1ce00). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19110 **[Test build #81436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81436/testReport)** for PR 19110 at commit [`7d0849e`](https://github.com/apache/spark/commit/7d0849eae7601eb3e24240cb8462985e95932f85). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137153437 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,38 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 ORC/Parquet table should not create invalid column names") { +Seq(" ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => + withTable("t21912") { +Seq("ORC", "PARQUET").foreach { source => + val m = intercept[AnalysisException] { +sql(s"CREATE TABLE t21912(`col$name` INT) USING $source") + }.getMessage + assert(m.contains(s"contains invalid character(s)")) + + val m2 = intercept[AnalysisException] { +sql(s"CREATE TABLE t21912 USING $source AS SELECT 1 `col$name`") + }.getMessage + assert(m2.contains(s"contains invalid character(s)")) + + withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { +val m3 = intercept[AnalysisException] { + sql(s"CREATE TABLE t21912(`col$name` INT) USING hive OPTIONS (fileFormat '$source')") +}.getMessage +assert(m3.contains(s"contains invalid character(s)")) + } +} + +// TODO: After SPARK-21929, we need to check ORC, too. +Seq("PARQUET").foreach { source => --- End diff -- I added only `Parquet` test case due to SPARK-21929. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137153372 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +val newDataSchema = StructType(catalogTable.dataSchema ++ columns) +DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) --- End diff -- For this command, it's not easy to get `CatalogTable` at `DataSourceStrategy`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] ORC/Parquet table should not create i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19124 **[Test build #81435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81435/testReport)** for PR 19124 at commit [`c6e9ab6`](https://github.com/apache/spark/commit/c6e9ab6291dda034fe39263202ea5bc2373cd86c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19102: [SPARK-21859][CORE] Fix SparkFiles.get failed on ...
Github user lgrcyanny closed the pull request at: https://github.com/apache/spark/pull/19102 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19117 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81433/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 @gatorsmile More comments on this ? Regarding the behavior change, should we follow Spark previous behavior or follow Hive? I'm ok with both. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19117 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19140: [SPARK-21890] Credentials not being passed to add...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19140#discussion_r137152903 --- Diff: core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala --- @@ -103,15 +103,17 @@ private[deploy] class HadoopFSDelegationTokenProvider(fileSystems: Configuration private def getTokenRenewalInterval( hadoopConf: Configuration, - filesystems: Set[FileSystem]): Option[Long] = { + filesystems: Set[FileSystem], + creds:Credentials): Option[Long] = { // We cannot use the tokens generated with renewer yarn. Trying to renew // those will fail with an access control issue. So create new tokens with the logged in // user as renewer. -val creds = fetchDelegationTokens( +val fetchCreds = fetchDelegationTokens( --- End diff -- That code was in `getTokenRenewalInterval`; that call is only needed when principal and keytab are provided, so adding the code back should be ok. It shouldn't cause any issues if it's not there, though, aside from a wasted round trip to the NNs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19117 **[Test build #81433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81433/testReport)** for PR 19117 at commit [`595e502`](https://github.com/apache/spark/commit/595e502e8bd6ac6570d0975188bd6039498ece2a). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19142: When the number of attempting to restart receiver greate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19142 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19142: When the number of attempting to restart receiver...
GitHub user liuxianjiao opened a pull request: https://github.com/apache/spark/pull/19142 When the number of attempting to restart receiver greater than 0,spark do nothing in 'else' When the number of attempting to restart receiver greater than 0,spark do nothing in 'else'.So I think we should log trace to let users know why. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liuxianjiao/spark master0905 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19142.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19142 commit c4edc1b4304f5b540b576ea60e260f5caef303c2 Author: liuxianjiaoDate: 2017-09-06T01:03:47Z [SPARK-21930]When the number of attempting to restart receiver greater than 0,spark do nothing in 'else' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19135: [SPARK-21923][CORE]Avoid call reserveUnrollMemoryForThis...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19135 hi @cloud-fan, The previous writing is the same as `putIteratorAsValues`. Now I have modified the code, each application for an additional `chunkSize` bytes of memory, because the size of `ChunkedByteBufferOutputStream` each growth is just `chunkSize`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18865 @jmchung, just to be clear, sure, let's go in this way and I guess we have the only comment left to address now: > Please update the error message and also add it to the migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19124 I created SPARK-21929 for **"Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source"**. For Parquet ALTER TABLE, yes. I think I can include that here. But, for the title of PR, I'm not sure. It's not clear because it's partial. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81432/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19124 **[Test build #81432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81432/testReport)** for PR 19124 at commit [`8ee87dd`](https://github.com/apache/spark/commit/8ee87dd0d799d0e4504ca11c1f1d31f1141a0844). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19056: [SPARK-21765] Check that optimization doesn't affect isS...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/19056 LGTM. Will merge after tests pass. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19140 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19124 Could this PR cover this scenario? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19140 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81431/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18865 @gatorsmile, Thanks for elaborating this. Looks a fair point. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19140 **[Test build #81431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81431/testReport)** for PR 19140 at commit [`d72c08f`](https://github.com/apache/spark/commit/d72c08f72d02b2288e09566f191bfe310d6cfbc7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19124 For that, no. It's not considered yet like the other code path. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19124 Altering table add column with illegal column names will issue an error message? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19124 Parquet works. I tested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19124 How about Parquet? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19050: [SPARK-21835][SQL] RewritePredicateSubquery shoul...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19050#discussion_r137148036 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala --- @@ -98,6 +122,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p) Project(p.output, Filter(newCond.get, inputPlan)) } + dedupJoin(rewritten) --- End diff -- After rethinking it, we can be more conservative. Instead of doing a dedup at the end, we should do it when we convert it to the `Join`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasource table...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19124 Ah, there are too many missing things in ORC code path. `AlterTableAddColumnsCommand` seems not to allow ORC in [verifyAlterTableAddColumn](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L237-L241 ). It seems to be blocked by different reason, but it looks like we need to solve that first in order to add test cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19020 Looks good. cc @jkbradley Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19141: [SPARK-21384] [YARN] Spark 2.2 + YARN without spark.yarn...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19141 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19126: [SPARK-21915][ML][PySpark]Model 1 and Model 2 ParamMaps ...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19126 Yeah, I checked and this is not a problem in master since #17849 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19141: [SPARK-21384] [YARN] Spark 2.2 + YARN without spa...
GitHub user devaraj-kavali opened a pull request: https://github.com/apache/spark/pull/19141 [SPARK-21384] [YARN] Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails ## What changes were proposed in this pull request? When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization. This change removes the deletion of the libraries zip file immediately and allowing it to delete as part of the ShutdownHookManager deletion of paths. ## How was this patch tested? I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems. You can merge this pull request into a Git repository by running: $ git pull https://github.com/devaraj-kavali/spark SPARK-21384 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19141 commit 208bb685cc899b705aadb7c5aba51334f2d340f0 Author: Devaraj KDate: 2017-09-06T00:22:54Z [SPARK-21384] [YARN] Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasourc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137146867 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -130,10 +130,12 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case CreateTable(tableDesc, mode, None) if DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc) CreateDataSourceTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore) case CreateTable(tableDesc, mode, Some(query)) if query.resolved && DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc.copy(schema = query.schema)) CreateDataSourceTableAsSelectCommand(tableDesc, mode, query) case InsertIntoTable(l @ LogicalRelation(_: InsertableRelation, _, _, _), --- End diff -- So far, it looks different than CTAS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC/Parquet datasourc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137146817 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -130,10 +130,12 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case CreateTable(tableDesc, mode, None) if DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc) CreateDataSourceTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore) case CreateTable(tableDesc, mode, Some(query)) if query.resolved && DDLUtils.isDatasourceTable(tableDesc) => + DDLUtils.checkFieldNames(tableDesc.copy(schema = query.schema)) CreateDataSourceTableAsSelectCommand(tableDesc, mode, query) case InsertIntoTable(l @ LogicalRelation(_: InsertableRelation, _, _, _), --- End diff -- Sorry, but I'm not sure when `INSERT INTO TABLE` has this kind of issue. In case of `INSERT INTO`, the table already exists. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org