[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78689714 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) + + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: Model[_]): this.type = { --- End diff -- As a follow on, we could eliminate the setter `def setInitialModel(value: Model[_])`. To have better documentation, we could leave the param as abstract in the `HasInitialModel` trait: scala def hasInitialModel: Param[T] Then, when we add this to new models, we implement the param there. So, in KMeansParams: scala /** * Param for KMeansModel to use for warm start". * @group param */ final val hasInitialModel: Param[KMeansModel] = new Param[KMeansModel](this, "initialModel", "A KMeansModel to use for warm start") That way the params are explicit in what type of model is used for initial model and the documentation is more clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/14961 Confirmed the issue was introduced by https://github.com/netty/netty/commit/d58dec8862e02fc2a98f8dcdb166db4b788be50a#diff-8d83d75ebf8a18cc48bf0a0b1183c188 Add `System.setProperty("io.netty.maxDirectMemory", "0");` to disable this feature then the tests pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78689452 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -330,14 +332,237 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty)) dfNoCols.write.format("json").saveAsTable(table_no_cols) sql(s"ANALYZE TABLE $table_no_cols COMPUTE STATISTICS") - checkStats( + checkTableStats( table_no_cols, isDataSourceTable = true, hasSizeInBytes = true, expectedRowCounts = Some(10)) } } + private def checkColStats( --- End diff -- I used `checkTableStats` in some cases for column stats, so maybe put all test cases for table/column stats into a separate file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14444: [SPARK-16839] [SQL] redundant aliases after cleanupAlias...
Github user eyalfa commented on the issue: https://github.com/apache/spark/pull/1 @HyukjinKwon, thank you very much for your analysis. if you read the history of this PR you'd see that at some point @hvanhovell suggested that we completely remove CreateStruct and CreateStructUnsafe and just leave a constructor that create the named version. I've modified catalyst tests that relied on CreateStruct, so I guess R tests should be modified as well. One thing I don't really understand, which is probably related to my (complete) lack of R knowledge: in scala API collect returns rows, what does it return in R, what does the 'named_struct(...)' come from? is it the column name in the schema? @hvanhovell: how strong is the contract of assigning a name to an unnamed column? should we alias the constructed tree with the backward compatible name? (when creating the named struct) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14834 **[Test build #65355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65355/consoleFull)** for PR 14834 at commit [`f537543`](https://github.com/apache/spark/commit/f53754313e0acf2da6d2f923f716b70c7a49e616). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14834 @dbtsai Thanks for your review. I addressed all but one comment, which I left a follow up on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14834 **[Test build #65354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65354/consoleFull)** for PR 14834 at commit [`0c2de2c`](https://github.com/apache/spark/commit/0c2de2cf70e07dd30960cccd422a4ca4ca35b594). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15092: [SPARK-17142][SQL] Complex query triggers binding error ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15092 **[Test build #65353 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65353/consoleFull)** for PR 15092 at commit [`dc3b1b2`](https://github.com/apache/spark/commit/dc3b1b288d7340183250acf2765da61497790c64). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78688956 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) +val invalidColumns = columnNames.filterNot { col => attributeNames.contains(col.toLowerCase)} +if (invalidColumns.nonEmpty) { + throw new AnalysisException(s"Invalid columns for table $tableName: $invalidColumns.") +} + +relation match { + case catalogRel: CatalogRelation => +updateStats(catalogRel.catalogTable, + AnalyzeTableCommand.calculateTotalSize(sparkSession, catalogRel.catalogTable)) + + case logicalRel: LogicalRelation if logicalRel.catalogTable.isDefined => +updateStats(logicalRel.catalogTable.get, logicalRel.relation.sizeInBytes) + + case otherRelation => +throw new AnalysisException(s"ANALYZE TABLE is not supported for " + + s"${otherRelation.nodeName}.") +} + +def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = { + val lowerCaseNames = columnNames.map(_.toLowerCase) + val attributes = +relation.output.filter(attr => lowerCaseNames.contains(attr.name.toLowerCase)) + + // collect column statistics + val aggColumns = mutable.ArrayBuffer[Column](count(Column("*"))) + attributes.foreach(entry => aggColumns ++= statsAgg(entry.name, entry.dataType)) + val statsRow: InternalRow = Dataset.ofRows(sparkSession, relation).select(aggColumns: _*) +.queryExecution.toRdd.collect().head + + // We also update table-level stats to prevent inconsistency in case of table modification + // between the two ANALYZE commands for collecting table-level stats and column-level stats. + val rowCount = statsRow.getLong(0) + var newStats: Statistics = if (catalogTable.stats.isDefined) { +catalogTable.stats.get.copy(sizeInBytes = newTotalSize, rowCount = Some(rowCount)) + } else { +Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount)) + } + + var pos = 1 + val colStats = mutable.HashMap[String, BasicColStats]() + attributes.foreach { attr => +attr.dataType match { + case n: NumericType => +colStats += attr.name -> BasicColStats( + dataType = attr.dataType, + numNulls = rowCount - statsRow.getLong(pos + NumericStatsAgg.numNotNullsIndex), + max = Option(statsRow.get(pos + NumericStatsAgg.maxIndex,
[GitHub] spark pull request #15092: [SPARK-17142][SQL] Complex query triggers binding...
GitHub user jiangxb1987 opened a pull request: https://github.com/apache/spark/pull/15092 [SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec [BACKPORT 2.0] ## What changes were proposed in this pull request? This PR backports #14917 to branch-2.0. It fixes a expression optimize bug caused by rule `ReorderAssociativeOperator `. ## How was this patch tested? Add new test case in ReorderAssociativeOperatorSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiangxb1987/spark rao-branch-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15092.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15092 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14926: [SPARK-17365][Core] Remove/Kill multiple executors toget...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14926 **[Test build #65352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65352/consoleFull)** for PR 14926 at commit [`202482b`](https://github.com/apache/spark/commit/202482bb2fb38c1a5c164fcbd9a214937fb0b392). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78688637 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -595,55 +831,104 @@ class LogisticRegressionModel private[spark] ( * Predict label for the given feature vector. * The behavior of this can be adjusted using [[thresholds]]. */ - override protected def predict(features: Vector): Double = { + override protected def predict(features: Vector): Double = if (isMultinomial) { +super.predict(features) --- End diff -- Would you mind elaborating? This calls ends up calling `predictRaw(features).argmax`, which equates to `margins(features).argmax`. What specialized version are you referring to? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78688327 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) +val invalidColumns = columnNames.filterNot { col => attributeNames.contains(col.toLowerCase)} +if (invalidColumns.nonEmpty) { + throw new AnalysisException(s"Invalid columns for table $tableName: $invalidColumns.") +} + +relation match { + case catalogRel: CatalogRelation => +updateStats(catalogRel.catalogTable, + AnalyzeTableCommand.calculateTotalSize(sparkSession, catalogRel.catalogTable)) + + case logicalRel: LogicalRelation if logicalRel.catalogTable.isDefined => +updateStats(logicalRel.catalogTable.get, logicalRel.relation.sizeInBytes) + + case otherRelation => +throw new AnalysisException(s"ANALYZE TABLE is not supported for " + + s"${otherRelation.nodeName}.") +} + +def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = { + val lowerCaseNames = columnNames.map(_.toLowerCase) + val attributes = +relation.output.filter(attr => lowerCaseNames.contains(attr.name.toLowerCase)) + + // collect column statistics + val aggColumns = mutable.ArrayBuffer[Column](count(Column("*"))) + attributes.foreach(entry => aggColumns ++= statsAgg(entry.name, entry.dataType)) + val statsRow: InternalRow = Dataset.ofRows(sparkSession, relation).select(aggColumns: _*) +.queryExecution.toRdd.collect().head + + // We also update table-level stats to prevent inconsistency in case of table modification + // between the two ANALYZE commands for collecting table-level stats and column-level stats. + val rowCount = statsRow.getLong(0) + var newStats: Statistics = if (catalogTable.stats.isDefined) { +catalogTable.stats.get.copy(sizeInBytes = newTotalSize, rowCount = Some(rowCount)) + } else { +Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount)) + } + + var pos = 1 + val colStats = mutable.HashMap[String, BasicColStats]() + attributes.foreach { attr => +attr.dataType match { + case n: NumericType => +colStats += attr.name -> BasicColStats( + dataType = attr.dataType, + numNulls = rowCount - statsRow.getLong(pos + NumericStatsAgg.numNotNullsIndex), + max = Option(statsRow.get(pos + NumericStatsAgg.maxIndex,
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78688210 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -508,11 +680,42 @@ object LogisticRegression extends DefaultParamsReadable[LogisticRegression] { @Since("1.4.0") class LogisticRegressionModel private[spark] ( @Since("1.4.0") override val uid: String, -@Since("2.0.0") val coefficients: Vector, -@Since("1.3.0") val intercept: Double) +@Since("2.1.0") val coefficientMatrix: Matrix, +@Since("2.1.0") val interceptVector: Vector, +@Since("1.3.0") override val numClasses: Int, +private val isMultinomial: Boolean) extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel] with LogisticRegressionParams with MLWritable { + @Since("2.0.0") + def coefficients: Vector = if (isMultinomial) { +throw new SparkException("Multinomial models contain a matrix of coefficients, use " + + "coefficientMatrix instead.") + } else { +_coefficients + } + + // convert to appropriate vector representation without replicating data + private lazy val _coefficients: Vector = coefficientMatrix match { +case dm: DenseMatrix => Vectors.dense(dm.values) --- End diff -- In that case, `coefficientMatrix` is a 1 x numFeatures dense matrix, I don't believe it makes any difference if it's row major or column major. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15091: [Core][Doc]:remove redundant comment
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15091 **[Test build #65351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65351/consoleFull)** for PR 15091 at commit [`c8afcb8`](https://github.com/apache/spark/commit/c8afcb8e51c20157ccd965231141b3b47b3130b6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15091: [Core][Doc]:remove redundant comment
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/15091 [Core][Doc]:remove redundant comment ## What changes were proposed in this pull request? In the comment, there is redundant `the estimated`. This PR simply remove the redundant comment and adjusts format. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark comment Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15091.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15091 commit f031bd91acf9c98a06afc9b6aa940248e17a8641 Author: wm...@hotmail.comDate: 2016-09-14T05:16:32Z remove redundant comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78687900 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { ctx.identifier != null && ctx.identifier.getText.toLowerCase == "noscan") { AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString) -} else { +} else if (ctx.identifierSeq() == null) { --- End diff -- yeah, I'm also thinking to do this:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r7868 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { ctx.identifier != null && ctx.identifier.getText.toLowerCase == "noscan") { AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString) -} else { +} else if (ctx.identifierSeq() == null) { --- End diff -- Then, issue an exception here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14444: [SPARK-16839] [SQL] redundant aliases after cleanupAlias...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/1 cc @shivaram Would this be sensible if we print the results if R tests failed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78687294 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { ctx.identifier != null && ctx.identifier.getText.toLowerCase == "noscan") { AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString) -} else { +} else if (ctx.identifierSeq() == null) { --- End diff -- For analyze column command, users should know exactly what they want to do. So they need to specify the columns, otherwise, we don't compute statistics for columns. AFAIK, hive will generate all column stats for this case, but I think we should not do that. At least, we could provide other command like FOR ALL COLUMNS to do this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14962: [SPARK-17402][SQL] separate the management of tem...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14962#discussion_r78687128 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -457,6 +457,20 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be checkAnswer(df2, df) } + test("save as table if a same-name temp view exists") { +import SaveMode._ +for (mode <- Seq(Append, ErrorIfExists, Overwrite, Ignore)) { + withTable("same_name") { +withTempView("same_name") { + spark.range(10).createTempView("same_name") + spark.range(20).write.mode(mode).saveAsTable("same_name") + checkAnswer(spark.table("same_name"), spark.range(10).toDF()) + checkAnswer(spark.table("default.same_name"), spark.range(20).toDF()) +} + } +} + } --- End diff -- Let's add comments to explain what this test is for in case we accidentally delete it in future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78687147 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) --- End diff -- Yeah. In Spark, we have the SQLConf `spark.sql.caseSensitive` to control it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14962: [SPARK-17402][SQL] separate the management of tem...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14962#discussion_r78687123 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala --- @@ -322,6 +325,14 @@ class CatalogSuite assert(e2.message == "Cannot create a file-based external data source table without path") } + test("dropTempView if a same-name table exists") { +withTable("same_name") { + sql("CREATE TABLE same_name(i int) USING json") + spark.catalog.dropTempView("same_name") + assert(spark.sessionState.catalog.tableExists(TableIdentifier("same_name"))) +} + } --- End diff -- Let's add comments to explain what this test is for in case we accidentally delete it in future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14962: [SPARK-17402][SQL] separate the management of tem...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14962#discussion_r78687075 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2661,4 +2661,15 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { data.selectExpr("`part.col1`", "`col.1`")) } } + + test("CREATE TABLE USING if a same-name temp view exists") { +withTable("same_name") { + withTempView("same_name") { +spark.range(10).createTempView("same_name") +sql("CREATE TABLE same_name(i int) USING json") +checkAnswer(spark.table("same_name"), spark.range(10).toDF()) +assert(spark.table("default.same_name").collect().isEmpty) + } +} + } --- End diff -- Let's add comments to explain what this test is for in case we accidentally delete it in future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14981 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14981 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65345/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 Like Hive, I think we should implement a built-in function, `compute_stats`. Then, the implementation of `AnalyzeColumnCommand` will be much cleaner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14981 **[Test build #65345 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65345/consoleFull)** for PR 14981 at commit [`07eb037`](https://github.com/apache/spark/commit/07eb0372bbb70eb6a2d661dbdb28750020ba500b). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78686975 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) --- End diff -- key and KeY are different columns? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14962: [SPARK-17402][SQL] separate the management of tem...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14962#discussion_r78686868 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -457,6 +457,20 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be checkAnswer(df2, df) } + test("save as table if a same-name temp view exists") { +import SaveMode._ +for (mode <- Seq(Append, ErrorIfExists, Overwrite, Ignore)) { + withTable("same_name") { +withTempView("same_name") { + spark.range(10).createTempView("same_name") + spark.range(20).write.mode(mode).saveAsTable("same_name") + checkAnswer(spark.table("same_name"), spark.range(10).toDF()) + checkAnswer(spark.table("default.same_name"), spark.range(20).toDF()) +} + } +} + } --- End diff -- This is a regression test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14962: [SPARK-17402][SQL] separate the management of tem...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14962#discussion_r78686835 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala --- @@ -322,6 +325,14 @@ class CatalogSuite assert(e2.message == "Cannot create a file-based external data source table without path") } + test("dropTempView if a same-name table exists") { +withTable("same_name") { + sql("CREATE TABLE same_name(i int) USING json") + spark.catalog.dropTempView("same_name") + assert(spark.sessionState.catalog.tableExists(TableIdentifier("same_name"))) +} + } --- End diff -- This is a regression test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14962: [SPARK-17402][SQL] separate the management of tem...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14962#discussion_r78686776 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2661,4 +2661,15 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { data.selectExpr("`part.col1`", "`col.1`")) } } + + test("CREATE TABLE USING if a same-name temp view exists") { +withTable("same_name") { + withTempView("same_name") { +spark.range(10).createTempView("same_name") +sql("CREATE TABLE same_name(i int) USING json") +checkAnswer(spark.table("same_name"), spark.range(10).toDF()) +assert(spark.table("default.same_name").collect().isEmpty) + } +} + } --- End diff -- This is a regression test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78686462 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) +val invalidColumns = columnNames.filterNot { col => attributeNames.contains(col.toLowerCase)} +if (invalidColumns.nonEmpty) { + throw new AnalysisException(s"Invalid columns for table $tableName: $invalidColumns.") +} + +relation match { + case catalogRel: CatalogRelation => +updateStats(catalogRel.catalogTable, + AnalyzeTableCommand.calculateTotalSize(sparkSession, catalogRel.catalogTable)) + + case logicalRel: LogicalRelation if logicalRel.catalogTable.isDefined => +updateStats(logicalRel.catalogTable.get, logicalRel.relation.sizeInBytes) + + case otherRelation => +throw new AnalysisException(s"ANALYZE TABLE is not supported for " + + s"${otherRelation.nodeName}.") +} + +def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = { + val lowerCaseNames = columnNames.map(_.toLowerCase) + val attributes = +relation.output.filter(attr => lowerCaseNames.contains(attr.name.toLowerCase)) + + // collect column statistics + val aggColumns = mutable.ArrayBuffer[Column](count(Column("*"))) + attributes.foreach(entry => aggColumns ++= statsAgg(entry.name, entry.dataType)) + val statsRow: InternalRow = Dataset.ofRows(sparkSession, relation).select(aggColumns: _*) +.queryExecution.toRdd.collect().head + + // We also update table-level stats to prevent inconsistency in case of table modification + // between the two ANALYZE commands for collecting table-level stats and column-level stats. + val rowCount = statsRow.getLong(0) + var newStats: Statistics = if (catalogTable.stats.isDefined) { +catalogTable.stats.get.copy(sizeInBytes = newTotalSize, rowCount = Some(rowCount)) + } else { +Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount)) + } + + var pos = 1 + val colStats = mutable.HashMap[String, BasicColStats]() + attributes.foreach { attr => +attr.dataType match { + case n: NumericType => +colStats += attr.name -> BasicColStats( + dataType = attr.dataType, + numNulls = rowCount - statsRow.getLong(pos + NumericStatsAgg.numNotNullsIndex), + max = Option(statsRow.get(pos + NumericStatsAgg.maxIndex,
[GitHub] spark issue #14444: [SPARK-16839] [SQL] redundant aliases after cleanupAlias...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/1 I see. It seems using `struct(...)` does not print `struct(...)` but `named_struct(...)` as specified in `CreateNamedStruct`. So, the code below: ```scala scala> spark.range(1).selectExpr("struct(1, 2)").show() ``` prints below: **Before** ```bash +--+ |struct(col1, col2)| +--+ | [1,2]| +--+ ``` **After** ```bash +--+ |named_struct(col1, 1, col2, 2)| +--+ | [1,2]| +--+ ``` Would this be necessary to remove both `CreateStruct` and `CreateStructUnsafe`? I think we might have to introduce common parent if possible. BTW, the failed R tests are as below: ```r df <- createDataFrame(list(list(1L, 2L, 3L), list(4L, 5L, 6L)), schema = c("a", "b", "c")) result <- collect(select(df, struct("a", "c"))) expected <- data.frame(row.names = 1:2) expected$"struct(a, c)" <- list(listToStruct(list(a = 1L, c = 3L)), listToStruct(list(a = 4L, c = 6L))) ``` ```r > result named_struct(a, a, c, c) 1 1, 3 2 4, 6 > expected struct(a, c) 1 1, 3 2 4, 6 ``` ```r result <- collect(select(df, struct(df$a, df$b))) expected <- data.frame(row.names = 1:2) expected$"struct(a, b)" <- list(listToStruct(list(a = 1L, b = 2L)), listToStruct(list(a = 4L, b = 5L))) ``` ```r > result named_struct(a, a, b, b) 1 1, 2 2 4, 5 > expected struct(a, b) 1 1, 2 2 4, 5 ``` Therefore, it seems we definitely need a test for the names as these holes were identified here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78685367 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) +val invalidColumns = columnNames.filterNot { col => attributeNames.contains(col.toLowerCase)} +if (invalidColumns.nonEmpty) { + throw new AnalysisException(s"Invalid columns for table $tableName: $invalidColumns.") +} + +relation match { + case catalogRel: CatalogRelation => +updateStats(catalogRel.catalogTable, + AnalyzeTableCommand.calculateTotalSize(sparkSession, catalogRel.catalogTable)) + + case logicalRel: LogicalRelation if logicalRel.catalogTable.isDefined => +updateStats(logicalRel.catalogTable.get, logicalRel.relation.sizeInBytes) + + case otherRelation => +throw new AnalysisException(s"ANALYZE TABLE is not supported for " + --- End diff -- This `s` is useless. Right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78685328 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) +val invalidColumns = columnNames.filterNot { col => attributeNames.contains(col.toLowerCase)} --- End diff -- Also verify whether the list contains duplicate columns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78685262 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import scala.collection.mutable + +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.{BasicColStats, Statistics} +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + + +/** + * Analyzes the given columns of the given table in the current database to generate statistics, + * which will be used in query optimizations. + */ +case class AnalyzeColumnCommand( +tableName: String, +columnNames: Seq[String]) extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) +val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) + +// check correctness for column names +val attributeNames = relation.output.map(_.name.toLowerCase) --- End diff -- Please consider case sensitivity here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78685116 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { ctx.identifier != null && ctx.identifier.getText.toLowerCase == "noscan") { AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString) -} else { +} else if (ctx.identifierSeq() == null) { --- End diff -- This has a bug. It will jump to this branch, if users input ```SQL ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14962: [SPARK-17402][SQL] separate the management of temp views...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14962 Is it possible to first have a PR to fix the bugs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78684701 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { ctx.identifier != null && ctx.identifier.getText.toLowerCase == "noscan") { AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString) -} else { +} else if (ctx.identifierSeq() == null) { --- End diff -- Since this PR changes the Parser, please update the comment of this function to reflect the latest changes. In addition, please add the test cases in `DDLCommandSuite` for verifying the Parser's behaviors --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14971 **[Test build #65350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65350/consoleFull)** for PR 14971 at commit [`9e18ba1`](https://github.com/apache/spark/commit/9e18ba104527d2bb14331f4b51194002dabb2556). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14118 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65343/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14118 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14118 **[Test build #65343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65343/consoleFull)** for PR 14118 at commit [`d5357f9`](https://github.com/apache/spark/commit/d5357f9d784cc277d58fd896738a87a7aff7ba70). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14971 @hvanhovell @cloud-fan Could you help me review this PR? https://github.com/apache/spark/pull/15090 is changing the same code path for column-level statistics. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14971 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r78683972 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -330,14 +332,237 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty)) dfNoCols.write.format("json").saveAsTable(table_no_cols) sql(s"ANALYZE TABLE $table_no_cols COMPUTE STATISTICS") - checkStats( + checkTableStats( table_no_cols, isDataSourceTable = true, hasSizeInBytes = true, expectedRowCounts = Some(10)) } } + private def checkColStats( --- End diff -- This test suite becomes bigger and bigger. For column stats, let us create a new file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15026: [SPARK-17472] [PYSPARK] Better error message for ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15026#discussion_r78683777 --- Diff: python/pyspark/broadcast.py --- @@ -75,7 +75,13 @@ def __init__(self, sc=None, value=None, pickle_registry=None, path=None): self._path = path def dump(self, value, f): -pickle.dump(value, f, 2) +try: +pickle.dump(value, f, 2) +except pickle.PickleError: +raise +except Exception as e: +msg = "Could not serialize broadcast: " + e.__class__.__name__ + ": " + e.message +raise pickle.PicklingError(msg) --- End diff -- It seems we use print_exec() elsewhere so going to use that for consistency. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65349/consoleFull)** for PR 15090 at commit [`027bdcc`](https://github.com/apache/spark/commit/027bdcc59b1b01a8dac436dd3a86600c2451c95f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14962: [SPARK-17402][SQL] separate the management of tem...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14962#discussion_r78683471 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -439,7 +439,7 @@ class Analyzer( object ResolveRelations extends Rule[LogicalPlan] { private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +catalog.lookupTempViewOrRelation(u.tableIdentifier, u.alias) --- End diff -- This is also for view, right? Should we just keep the old name? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14980 @junyangq As we discussed before, lets open a new PR for 2.0 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14980 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15073: [SPARK-17518] [SQL] Block Users to Specify the Internal ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15073 **[Test build #65348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65348/consoleFull)** for PR 15073 at commit [`9711edb`](https://github.com/apache/spark/commit/9711edb25f401703e08e51cc6f4f0495731da12a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65347/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65347 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65347/consoleFull)** for PR 15090 at commit [`59ae3df`](https://github.com/apache/spark/commit/59ae3dfc45751705962a1370c195c67c6302c376). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BasicColStats(` * `case class AnalyzeColumnCommand(` * `trait StatsAggFunc ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15085 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65342/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15085 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65347 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65347/consoleFull)** for PR 15090 at commit [`59ae3df`](https://github.com/apache/spark/commit/59ae3dfc45751705962a1370c195c67c6302c376). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15085 **[Test build #65342 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65342/consoleFull)** for PR 15085 at commit [`f60c4be`](https://github.com/apache/spark/commit/f60c4be307cf21bf61b27942ed75887546021458). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/15090 [SPARK-17073] [SQL] generate column-level statistics ## What changes were proposed in this pull request? Generate basic column statistics for all the atomic types: - numeric types: max, min, num of nulls, ndv (number of distinct values) - date/timestamp types: they are also represented as numbers internally, so they have the same stats as above. - string: avg length, max length, num of nulls, ndv - binary: avg length, max length, num of nulls - boolean: num of nulls, num of trues, num of falsies, ndv (must be 2) Also support storing and loading these statistics. ## How was this patch tested? add unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark colStats Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15090.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15090 commit 59ae3dfc45751705962a1370c195c67c6302c376 Author: Zhenhua WangDate: 2016-09-14T03:03:05Z support column-level stats --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15042: [SPARK-17449] [Documentation] [Relation between heartbea...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15042 **[Test build #65346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65346/consoleFull)** for PR 15042 at commit [`1a76a56`](https://github.com/apache/spark/commit/1a76a56c25fd89ff409f856a83c5b1464d153607). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78682701 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) + + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: Model[_]): this.type = { --- End diff -- +1 on using `KMeansModel.fromCenters(centers)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15059: [SPARK-17506][SQL] Improve the check double values equal...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/15059 Moving generic testing utils from mllib to common looks OK to me. Actually we have ```TestingUtils``` under both spark.ml.util and spark.mllib.util. If we would like to move, we should remove both of them. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14980 Thanks @junyangq and @felixcheung - Merging this into master once the AppVeyor check passes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14981 **[Test build #65345 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65345/consoleFull)** for PR 14981 at commit [`07eb037`](https://github.com/apache/spark/commit/07eb0372bbb70eb6a2d661dbdb28750020ba500b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14980 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14980 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65344/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14980 **[Test build #65344 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65344/consoleFull)** for PR 14980 at commit [`aa3f6a4`](https://github.com/apache/spark/commit/aa3f6a46fd27d7ad68973cb2426d06e20b6f0b32). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15000: [SPARK-17437] Add uiWebUrl to JavaSparkContext and pyspa...
Github user apetresc commented on the issue: https://github.com/apache/spark/pull/15000 @srowen: Just to make sure I understand, are you asking me to remove the Java accessor here, and just plumb straight through to the Scala object from PySpark? Or is it fine as-is? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15035 We definitely shouldn't change SpecificMutableRow to do this upcast; otherwise we might introduce subtle bugs with type mismatches in the future. cc @sameeragarwal to see if there is a better place to do this -- I think doing this in Parquet itself is pretty reasonable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14980 **[Test build #65344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65344/consoleFull)** for PR 14980 at commit [`aa3f6a4`](https://github.com/apache/spark/commit/aa3f6a46fd27d7ad68973cb2426d06e20b6f0b32). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14980#discussion_r78679227 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -385,22 +385,29 @@ head(result[order(result$max_mpg, decreasing = TRUE), ]) Similar to `lapply` in native R, `spark.lapply` runs a function over a list of elements and distributes the computations with Spark. `spark.lapply` works in a manner that is similar to `doParallel` or `lapply` to elements of a list. The results of all the computations should fit in a single machine. If that is not the case you can do something like `df <- createDataFrame(list)` and then use `dapply`. +We use `svm` in package `e1071` as an example. We use all default settings except for varying costs of constraints violation. `spark.lapply` can train those different models in parallel. + ```{r} -families <- c("gaussian", "poisson") -train <- function(family) { - model <- glm(mpg ~ hp, mtcars, family = family) +costs <- exp(seq(from = log(1), to = log(1000), length.out = 5)) --- End diff -- It runs as long as `e1071` is installed in the workers. Perhaps it's better to add a check there? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15085 **[Test build #65337 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65337/consoleFull)** for PR 15085 at commit [`f69a5ea`](https://github.com/apache/spark/commit/f69a5ea6eff2c6b9f1e07a5d1551c67cdee5ed2e). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15085 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65337/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15085 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14974: [Trivial][ML] Remove unnecessary `new` before cas...
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/14974 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14118 **[Test build #65343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65343/consoleFull)** for PR 14118 at commit [`d5357f9`](https://github.com/apache/spark/commit/d5357f9d784cc277d58fd896738a87a7aff7ba70). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/14118 @HyukjinKwon thanks for the information! @srowen yea I still think this is good to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14118: [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV ca...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/14118 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15060: [SPARK-17507][ML][MLLib] check weight vector size in ANN
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15060 @srowen the `weight` by default will randomly generated and it will automatically match the size, only when it is specified by user it will need this check... now the modification here seems to be the only path that get the user specified `weight`, if I missed checking something tell me, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15043: [SPARK-17491] Close serialization stream to fix wrong an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15043 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65341/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15043: [SPARK-17491] Close serialization stream to fix wrong an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15043 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15089: [SPARK-15621] [SQL] Support spilling for Python UDF
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15089 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65340/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15043: [SPARK-17491] Close serialization stream to fix wrong an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15043 **[Test build #65341 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65341/consoleFull)** for PR 15043 at commit [`2f43e69`](https://github.com/apache/spark/commit/2f43e69c69e28ae76364155b9c8a178380b55ff3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15089: [SPARK-15621] [SQL] Support spilling for Python UDF
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15089 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15089: [SPARK-15621] [SQL] Support spilling for Python UDF
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15089 **[Test build #65340 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65340/consoleFull)** for PR 15089 at commit [`4964b9a`](https://github.com/apache/spark/commit/4964b9a611ed01aaa5252ac642df94db07a38868). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15085 **[Test build #65342 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65342/consoleFull)** for PR 15085 at commit [`f60c4be`](https://github.com/apache/spark/commit/f60c4be307cf21bf61b27942ed75887546021458). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/14834 Only couple minor issues; otherwise, LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78674556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -201,11 +201,24 @@ abstract class ProbabilisticClassificationModel[ probability.argmax } else { val thresholds: Array[Double] = getThresholds - val scaledProbability: Array[Double] = -probability.toArray.zip(thresholds).map { case (p, t) => - if (t == 0.0) Double.PositiveInfinity else p / t + val probabilities = probability.toArray + var argMax = 0 + var max = Double.NegativeInfinity + var i = 0 + while (i < probability.size) { --- End diff -- val length = probability.size --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r7867 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -676,39 +936,54 @@ object LogisticRegressionModel extends MLReadable[LogisticRegressionModel] { private case class Data( numClasses: Int, numFeatures: Int, -intercept: Double, -coefficients: Vector) +interceptVector: Vector, +coefficientMatrix: Matrix, +isMultinomial: Boolean) override protected def saveImpl(path: String): Unit = { // Save metadata and Params DefaultParamsWriter.saveMetadata(instance, path, sc) // Save model data: numClasses, numFeatures, intercept, coefficients - val data = Data(instance.numClasses, instance.numFeatures, instance.intercept, -instance.coefficients) + val data = Data(instance.numClasses, instance.numFeatures, instance.interceptVector, +instance.coefficientMatrix, instance.isMultinomial) val dataPath = new Path(path, "data").toString sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) } } - private class LogisticRegressionModelReader -extends MLReader[LogisticRegressionModel] { + private class LogisticRegressionModelReader extends MLReader[LogisticRegressionModel] { /** Checked against metadata when loading model */ private val className = classOf[LogisticRegressionModel].getName override def load(path: String): LogisticRegressionModel = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val versionRegex = "([0-9]+)\\.([0-9]+)\\.(.+)".r + val versionRegex(major, minor, _) = metadata.sparkVersion val dataPath = new Path(path, "data").toString val data = sparkSession.read.format("parquet").load(dataPath) - // We will need numClasses, numFeatures in the future for multinomial logreg support. - // TODO: remove numClasses and numFeatures fields? - val Row(numClasses: Int, numFeatures: Int, intercept: Double, coefficients: Vector) = -MLUtils.convertVectorColumnsToML(data, "coefficients") - .select("numClasses", "numFeatures", "intercept", "coefficients") - .head() - val model = new LogisticRegressionModel(metadata.uid, coefficients, intercept) + val model = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt == 0)) { --- End diff -- How about `2.0.1`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15085: [SPARK-17484] Prevent invalid block locations fro...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/15085#discussion_r78674398 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -857,9 +862,11 @@ private[spark] class BlockManager( val startTimeMs = System.currentTimeMillis var blockWasSuccessfullyStored: Boolean = false +var exceptionWasThrown: Boolean = true val result: Option[T] = try { val res = putBody(putBlockInfo) blockWasSuccessfullyStored = res.isEmpty + exceptionWasThrown = false res } finally { if (blockWasSuccessfullyStored) { --- End diff -- That said, I think we could simplify this by moving the non-error-case code into the `try` block. Let me do that now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15085: [SPARK-17484] Prevent invalid block locations fro...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/15085#discussion_r78674369 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -857,9 +862,11 @@ private[spark] class BlockManager( val startTimeMs = System.currentTimeMillis var blockWasSuccessfullyStored: Boolean = false +var exceptionWasThrown: Boolean = true val result: Option[T] = try { val res = putBody(putBlockInfo) blockWasSuccessfullyStored = res.isEmpty + exceptionWasThrown = false res } finally { if (blockWasSuccessfullyStored) { --- End diff -- One concern with using a `catch` here is handling of `InterruptedException`: if we use `case NonFatal(e)` that won't match `InterruptedException` and we'll miss out on cleanup following that. If we catch `Throwable`, on the other hand, then I think that we'll end up clearing the `isInterrupted` bit for `InterruptedException`s and it'll be awkward to match and re-set it when rethrowing. Therefore I'd like to keep the exception-handling case in the `finally` block with a simple check to see if we entered that block via an error case. Note that I've seen this same exception-handling idiom used in Java code, where code that catches and re-throws `Throwable` won't compile in older Java versions because of the checked exception-handling (I think that newer versions are a bit more permissive about throwing exceptions from a `catch` block). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78674092 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -595,55 +831,104 @@ class LogisticRegressionModel private[spark] ( * Predict label for the given feature vector. * The behavior of this can be adjusted using [[thresholds]]. */ - override protected def predict(features: Vector): Double = { + override protected def predict(features: Vector): Double = if (isMultinomial) { +super.predict(features) --- End diff -- maybe we want to have the specialized version when thresholds is not defined? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78673689 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -508,11 +680,42 @@ object LogisticRegression extends DefaultParamsReadable[LogisticRegression] { @Since("1.4.0") class LogisticRegressionModel private[spark] ( @Since("1.4.0") override val uid: String, -@Since("2.0.0") val coefficients: Vector, -@Since("1.3.0") val intercept: Double) +@Since("2.1.0") val coefficientMatrix: Matrix, +@Since("2.1.0") val interceptVector: Vector, +@Since("1.3.0") override val numClasses: Int, +private val isMultinomial: Boolean) extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel] with LogisticRegressionParams with MLWritable { + @Since("2.0.0") + def coefficients: Vector = if (isMultinomial) { +throw new SparkException("Multinomial models contain a matrix of coefficients, use " + + "coefficientMatrix instead.") + } else { +_coefficients + } + + // convert to appropriate vector representation without replicating data + private lazy val _coefficients: Vector = coefficientMatrix match { +case dm: DenseMatrix => Vectors.dense(dm.values) --- End diff -- I think you need to check `coefficientMatrix.isTransposed` even it's dense here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15085 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15085 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65339/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15085: [SPARK-17484] Prevent invalid block locations from being...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15085 **[Test build #65339 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65339/consoleFull)** for PR 15085 at commit [`8ab3108`](https://github.com/apache/spark/commit/8ab3108569e5812e0e81b77e3dfb0be1f7e557ce). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14691: [SPARK-16407][STREAMING] Allow users to supply custom st...
Github user jodersky commented on the issue: https://github.com/apache/spark/pull/14691 I like the idea! This is might not be the best place to start a discussion, but I reckon that the sink provider api could also eventually be used to provision builtin sinks. It would make the current, stringly-typed api optional and provide more compile-time safety. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15088: SPARK-17532: Add lock debugging info to thread dumps.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65336/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org