[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Sorry, I gave a wrong answer at the beginning. Next time, I will review it more carefully before leaving the comment. Thank you for your work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14660: [SPARK-17071][SQL] Fetch Parquet schema without another ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14660 **[Test build #63828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63828/consoleFull)** for PR 14660 at commit [`e1214d5`](https://github.com/apache/spark/commit/e1214d50035441fb96551683cf38ae3e49f07b7d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 :) I think about this issue again. At this stage, could you make a PR for this? I think you're the best person to do that. You made this optimizer and found the correct fix. This was a nice change to investigate this optimizer and nullability propagation for me. @gatorsmile . Thank you for reviewing this. I'll close this PR soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/14660 [SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it is a single file to touch ## What changes were proposed in this pull request? It seems Spark executes another job to figure out schema always ([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)). However, it seems it's a bit of overhead to touch only a single file. I ran a bench mark with the code below: ```scala test("Benchmark for JSON writer") { withTempPath { path => Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d") .write.format("parquet").save(path.getAbsolutePath) val benchmark = new Benchmark("Parquet - read schema", 1) benchmark.addCase("Parquet - read schema", 10) { _ => spark.read.format("parquet").load(path.getCanonicalPath).schema } benchmark.run() } } ``` with the results as below: - **Before** ```scala Parquet - read schema: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet - read schema 47 / 49 0.0 46728419.0 1.0X ``` - **After** ```scala Parquet - read schema: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet - read schema2 /3 0.0 1811673.0 1.0X ``` It seems it became 20X faster (although It is a small bit in total job run-time). As a reference, it seems ORC is doing this within driver-side [OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83). ## How was this patch tested? Existing tests should cover this You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-17071 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14660.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14660 commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b Author: hyukjinkwonDate: 2016-08-16T05:42:29Z Fetch Parquet schema within driver-side when there is single file to touch without another Spark job commit e1214d50035441fb96551683cf38ae3e49f07b7d Author: hyukjinkwon Date: 2016-08-16T05:46:12Z Fix modifier --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/13796 @sethah Thank you for great work. I'll make another pass tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 One more try: ```Scala val splitConjunctiveConditions: Seq[Expression] = splitConjunctivePredicates(filter.condition) val conditions = splitConjunctiveConditions ++ filter.constraints val leftConditions = conditions.filter(_.references.subsetOf(join.left.outputSet)) val rightConditions = conditions.filter(_.references.subsetOf(join.right.outputSet)) val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` Does this have a hole? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14616 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63821/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14616 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Another version. : ) ```Scala val splitConjunctiveConditions: Seq[Expression] = splitConjunctivePredicates(filter.condition) val conditions = splitConjunctiveConditions ++ filter.constraints.filter(_.isInstanceOf[IsNotNull]) val leftConditions = conditions.filter(_.references.subsetOf(join.left.outputSet)) val rightConditions = conditions.filter(_.references.subsetOf(join.right.outputSet)) val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14616 **[Test build #63821 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)** for PR 14616 at commit [`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63827/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)** for PR 14392 at commit [`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 How about another version? ``` val leftConditions = (splitConjunctiveConditions ++ filter.constraints.filter(_.isInstanceOf[IsNotNull])) .filter(_.references.subsetOf(join.left.outputSet)) val rightConditions = (splitConjunctiveConditions ++ filter.constraints.filter(_.isInstanceOf[IsNotNull])) .filter(_.references.subsetOf(join.right.outputSet)) val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Oh, that would be perfect fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63826/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63826 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)** for PR 14182 at commit [`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 How about this fix? ``` val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => expr.references.subsetOf(join.left.outputSet) && canFilterOutNull(expr)) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => expr.references.subsetOf(join.right.outputSet) && canFilterOutNull(expr)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13796 @dbtsai Thanks for taking the time to review this! Major items right now: * Adding derivation to the aggregator doc (this is mostly finished, just fighting scala doc with Latex) * Deciding whether to add initial model and tests in this PR or as a follow up * Refactoring logistic regression helper classes to a separate file Let me know if you see anything else. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63825 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)** for PR 14392 at commit [`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63825/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13796#discussion_r74876946 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala --- @@ -0,0 +1,626 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import scala.collection.mutable + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions.{col, lit} +import org.apache.spark.sql.types.DoubleType +import org.apache.spark.storage.StorageLevel + +/** + * Params for multinomial logistic regression. + */ +private[classification] trait MultinomialLogisticRegressionParams + extends ProbabilisticClassifierParams with HasRegParam with HasElasticNetParam with HasMaxIter +with HasFitIntercept with HasTol with HasStandardization with HasWeightCol { + + /** + * Set thresholds in multiclass (or binary) classification to adjust the probability of + * predicting each class. Array must have length equal to the number of classes, with values >= 0. + * The class with largest value p/t is predicted, where p is the original probability of that + * class and t is the class' threshold. + * + * @group setParam + */ + def setThresholds(value: Array[Double]): this.type = { +set(thresholds, value) + } + + /** + * Get thresholds for binary or multiclass classification. + * + * @group getParam + */ + override def getThresholds: Array[Double] = { +$(thresholds) + } +} + +/** + * :: Experimental :: + * Multinomial Logistic regression. + */ +@Since("2.1.0") +@Experimental +class MultinomialLogisticRegression @Since("2.1.0") ( +@Since("2.1.0") override val uid: String) + extends ProbabilisticClassifier[Vector, +MultinomialLogisticRegression, MultinomialLogisticRegressionModel] +with MultinomialLogisticRegressionParams with DefaultParamsWritable with Logging { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("mlogreg")) + + /** + * Set the regularization parameter. + * Default is 0.0. + * + * @group setParam + */ + @Since("2.1.0") + def setRegParam(value: Double): this.type = set(regParam, value) + + setDefault(regParam -> 0.0) + + /** + * Set the ElasticNet mixing parameter. + * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. + * For 0 < alpha < 1, the penalty is a combination of L1 and L2. + * Default is 0.0 which is an L2 penalty. + * + * @group setParam + */ + @Since("2.1.0") + def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value) + + setDefault(elasticNetParam -> 0.0) + + /** + * Set the maximum number of iterations. + * Default is 100. + * + * @group setParam + */ + @Since("2.1.0") + def setMaxIter(value: Int): this.type = set(maxIter, value) + + setDefault(maxIter -> 100) + + /** + * Set the convergence tolerance of iterations. + * Smaller value will lead to higher accuracy with the cost of more iterations. + * Default is
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63824/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63824 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)** for PR 14182 at commit [`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14359 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63822/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14359 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14359 **[Test build #63822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)** for PR 14359 at commit [`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)** for PR 14392 at commit [`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Another better fix is to use `nullable` in `Expression` for `IsNotNull` constraints. `filter.constraints.filter(_.isInstanceOf[IsNotNull])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 `canFilterOutNull ` will cover almost all the cases. Sorry, I did not read the plan until you asked me to write a test case. Then, I realized the implementation of natural/using join is just using `coalesce`. As @hvanhovell and @nsyca said, that is just a syntactic sugar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)** for PR 14182 at commit [`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14506: [SPARK-16916][SQL] serde/storage properties shoul...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14506 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14659 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Please let me think more on this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
GitHub user Sherry302 opened a pull request: https://github.com/apache/spark/pull/14659 [SPARK-16757] Set up Spark caller context to HDFS ## What changes were proposed in this pull request? 1. Pass `jobId` to Task. 2. Invoke Hadoop APIs. A new function `setCallerContext` is added in `Utils`. `setCallerContext` function invokes APIs of `org.apache.hadoop.ipc.CallerContext` to set up spark caller contexts, which will be written into `hdfs-audit.log`. For applications in Yarn client mode, `org.apache.hadoop.ipc.CallerContext` are called in `Task` and Yarn `Client`. For applications in Yarn cluster mode, `org.apache.hadoop.ipc.CallerContext` are be called in `Task` and `ApplicationMaster`. The Spark caller contexts written into `hdfs-audit.log` are applications' name` {spark.app.name}` and `JobID_stageID_stageAttemptId_taskID_attemptNumbe`. ## How was this patch tested? Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log successfully. For example, run SparkKmeans in Yarn client mode: `./bin/spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5` Before: There will be no Spark caller context in records of `hdfs-audit.log`. After: Spark caller contexts will be in records of `hdfs-audit.log`. (_Note: spark caller context below since Hadoop caller context API was invoked in Yarn Client_) `2016-07-21 13:52:30,802 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo src=/lr_big.txtdst=nullperm=nullproto=rpc callerContext=SparkKMeans running on Spark ` (_Note: spark caller context below since Hadoop caller context API was invoked in Task_) `2016-07-21 13:52:35,584 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=open src=/lr_big.txtdst=nullperm=nullproto=rpc callerContext=JobId_0_StageID_0_stageAttemptId_0_taskID_0_attemptNumber_0` You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sherry302/spark callercontextSubmit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14659.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14659 commit ec6833d32ef14950b2d81790bc908992f6288815 Author: Weiqing YangDate: 2016-08-16T04:11:41Z [SPARK-16757] Set up Spark caller context to HDFS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Yep. I agree. `Expr` could be anything. However, this will reduce the scope of this optimization greatly. Is it okay for you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14506 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14506 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63818/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14506 Thanks. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 If that is not applicable, I agree with @gatorsmile . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 That just resolves a specific case. The expressions could be much more complex. `Coalesce` can be used in a very deep layer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14506 **[Test build #63818 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63818/consoleFull)** for PR 14506 at commit [`3042af2`](https://github.com/apache/spark/commit/3042af2f0e9ae82e40d14e950a1036b9e417dbc9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 What about this if we could exclude those functions? ```scala val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) -.exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) +.exists(expr => !expr.isInstanceOf[Coalesce] && + leftOuterAttributeSet.intersect(expr.references).nonEmpty) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) -.exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty) +.exists(expr => !expr.isInstanceOf[Coalesce] && + rightOuterAttributeSet.intersect(expr.references).nonEmpty) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14447 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63820/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14447 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14447 **[Test build #63820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)** for PR 14447 at commit [`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14558#discussion_r74874929 --- Diff: R/pkg/R/functions.R --- @@ -1143,7 +1139,7 @@ setMethod("minute", #' @export #' @examples \dontrun{select(df, monotonically_increasing_id())} setMethod("monotonically_increasing_id", - signature(x = "missing"), + signature(), --- End diff -- Automatic generation of S4 methods is not desirable. I hope this case can be better handled by roxygen. For now, I agree (b) is a good solution to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 The right fix is to change the following statements ```Scala val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty) ``` to the following ones: ```Scala val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Sorry, my above description is not clear. `isnotnull(coalesce(b#227, c#238))` does not filter out `NULL` of `b#227` and `c#238`. Only when both are `b#227` and `c#238` are `NULL`, `coalesce(b#227, c#238)` returns `NULL`. Thus, we are unable to use the following two statements to conclude whether left or right has Non-Null predicates. ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) ``` and ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)** for PR 14392 at commit [`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14580 Can you explain `isnotnull(coalesce(b#227, c#238)) does not filter out NULL!!!`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14359 Btw, to give back-of-the-envelope estimates, we can look at 2 numbers: (1) How many nodes will be split on each iteration? (2) How big is the forest which is serialized and sent to workers on each iteration? For (1), here's an example: * 1000 features, each with 50 bins -> 5 possible splits * set maxMemoryInMB = 256 (default) * regression => 3 Double values per possible split * 256 * 10^6 / (3 * 5 * 8) = 213 nodes/iteration This implies that for trees of depth > 8 or so, many iterations will only split nodes from 1 or 2 trees. I.e., we should avoid communicating most trees. For (2), the forest can be pretty expensive to send. * Each node: * leaf node: 5 Doubles * internal node: ~8 Doubles/references + Split * Split: O(# categories) or 2 values for continuous, say 3 Doubles on average * => say 8 Doubles/node on average * 100 trees of depth 8 => 25600 nodes => 1.6MB * 100 trees of depth 14 => 105MB * I've heard of many cases of users wanting to fit 500-1000 trees and use trees of depth 18-20. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)** for PR 14182 at commit [`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be more t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14658 **[Test build #63823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63823/consoleFull)** for PR 14658 at commit [`443aa91`](https://github.com/apache/spark/commit/443aa91cfc2490be9733c78b7cd911f09bedfac6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 ```Scala val df12 = df1.join(df2, $"df1.a" === $"df2.a", "fullouter") .select(coalesce($"df1.b", $"df2.c").as("a"), $"df1.b", $"df2.c") df12.join(df3, "a").explain(true) ``` This is an example to show that we should not eliminate the outer join, even if `isnotnull(coalesce(b#227, c#238))` contains the attributes that are not in join conditions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14558#discussion_r74874081 --- Diff: R/pkg/R/SQLContext.R --- @@ -181,7 +181,7 @@ getDefaultSqlSource <- function() { #' @method createDataFrame default #' @note createDataFrame since 1.4.0 # TODO(davies): support sampling and infer type from NA -createDataFrame.default <- function(data, schema = NULL, samplingRatio = 1.0) { +createDataFrame.default <- function(data, schema = NULL) { --- End diff -- Oh yes... Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be...
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/14658 [WIP][SPARK-5928] Remote Shuffle Blocks cannot be more than 2 GB ## What changes were proposed in this pull request? Add class `ChunkFetchInputStream` and it have the following effects: 1. flow control [WIP] 2. reduce memory usage [WIP] 3. unlimited size [WIP] ## How was this patch tested? WIP You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark SPARK-5928_Shuffle_Blocks_2G Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14658.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14658 commit 443aa91cfc2490be9733c78b7cd911f09bedfac6 Author: Guoqiang LiDate: 2016-08-16T04:00:10Z Remote Shuffle Blocks cannot be more than 2 GB --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/14392#discussion_r74873932 --- Diff: R/pkg/R/generics.R --- @@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s #' @export setGeneric("spark.survreg", function(data, formula, ...) { standardGeneric("spark.survreg") }) +#' @rdname spark.gaussianMixture +#' @export +setGeneric("spark.gaussianMixture", + function(data, formula, ...) { + standardGeneric("spark.gaussianMixture") --- End diff -- It can not fit one line, since ```lint-r``` requires lines should not be more than 100 characters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14558#discussion_r74873867 --- Diff: R/pkg/R/mllib.R --- @@ -298,14 +304,15 @@ setMethod("summary", signature(object = "NaiveBayesModel"), #' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make #' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. #' -#' @param data SparkDataFrame for training -#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#' @param data a SparkDataFrame for training. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula #'operators are supported, including '~', '.', ':', '+', and '-'. #'Note that the response variable of formula is empty in spark.kmeans. -#' @param k Number of centers -#' @param maxIter Maximum iteration number -#' @param initMode The initialization algorithm choosen to fit the model -#' @return \code{spark.kmeans} returns a fitted k-means model +#' @param ... additional argument(s) passed to the method. --- End diff -- Yeah agreed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14628: [SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14628 @holdenk I think depth (2) is enough to handle large RDD and bigger depth may add cost. I'll append test result later. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14359 **[Test build #63822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)** for PR 14359 at commit [`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 None of us is right. : ( ```isnotnull(coalesce(b#227, c#238))``` does not filter out `NULL`!!! Thus, the right fix is to remove the second condition. ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]).exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) ``` and ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14359 Sorry for the long delay; I've been swamped by other things for a while. Re-emerging... I switched to Stack and then realized Stack has been deprecated in Scala 2.11, so I reverted to the original NodeQueue. But I renamed NodeQueue to NodeStack to be a bit clearer. @hhbyyh Any luck testing this at scale? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 I found the root cause. None of us is right. : ( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14647: [WIP][Test only][DEMO][SPARK-6235]Address various 2G lim...
Github user witgo commented on the issue: https://github.com/apache/spark/pull/14647 @hvanhovell I will submit some small PRs and provide a more high level description of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 You are right. I missed `UnsafeArrayData` is a subclass of `ArrayData`. We can pass `UnsafeArrayData` to an projection. I have one question. When we directly generate `UnsafeArrayData` from a primitive array and copy it into an `InternalRow` (`serializefromobject_result`), the following two operations are required: 1. Copy from a primitive array to `UnsafeArrayData` 2. Copy from `UnsafeArrayData` into `InternalRow` at line 102 On the other hand, this PR requires the following one operation 0. (No copy happens at line 086 since this PR just store a reference to a primitive array in `GenericArrayData`) 1. Copy from a primitive array to `InternalRow` ([this PR](https://github.com/apache/spark/pull/13911) performs `Platform.copy Memory` without no iteration. Can we avoid additional copy at 2. when we directly generate `UnsafeArrayData` from a primitive array? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14616 **[Test build #63821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)** for PR 14616 at commit [`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14649 Also, if my understanding is correct, we are picking up only single file to read footer (see [ParquetFileFormat.scala#L217-L225](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L217-L225)) unless we merge schemas. So, it seems, due to this reason, writing `_metadata` or `_common_metadata` is disabled (See https://issues.apache.org/jira/browse/SPARK-15719). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14649#discussion_r74872775 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -423,6 +425,54 @@ class ParquetFileFormat sqlContext.sessionState.newHadoopConf(), options) } + + override def filterPartitions( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + allFiles: Seq[FileStatus], + root: Path, + partitions: Seq[Partition]): Seq[Partition] = { +// Read the "_metadata" file if available, contains all block headers. On S3 better to grab +// all of the footers in a batch rather than having to read every single file just to get its +// footer. +allFiles.find(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE).map { stat => + val metadata = ParquetFileReader.readFooter(conf, stat, ParquetMetadataConverter.NO_FILTER) + partitions.map { part => +filterByMetadata( + filters, + schema, + conf, + root, + metadata, + part) + }.filterNot(_.files.isEmpty) +}.getOrElse(partitions) + } + + private def filterByMetadata( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + root: Path, + metadata: ParquetMetadata, + partition: Partition): Partition = { +val blockMetadatas = metadata.getBlocks.asScala +val parquetSchema = metadata.getFileMetaData.getSchema +val conjunctiveFilter = filters + .flatMap(ParquetFilters.createFilter(schema, _)) + .reduceOption(FilterApi.and) +conjunctiveFilter.map { conjunction => + val filteredBlocks = RowGroupFilter.filterRowGroups( --- End diff -- Do you mind if I ask a question please? So, if my understanding is correct, Parquet filters rowgroups in both normal reader and vectorized reader already (https://github.com/apache/spark/pull/13701). Is this doing the same thing in Spark-side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14649#discussion_r74872795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -423,6 +425,54 @@ class ParquetFileFormat sqlContext.sessionState.newHadoopConf(), options) } + + override def filterPartitions( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + allFiles: Seq[FileStatus], + root: Path, + partitions: Seq[Partition]): Seq[Partition] = { +// Read the "_metadata" file if available, contains all block headers. On S3 better to grab +// all of the footers in a batch rather than having to read every single file just to get its +// footer. +allFiles.find(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE).map { stat => + val metadata = ParquetFileReader.readFooter(conf, stat, ParquetMetadataConverter.NO_FILTER) + partitions.map { part => +filterByMetadata( + filters, + schema, + conf, + root, + metadata, + part) + }.filterNot(_.files.isEmpty) +}.getOrElse(partitions) + } + + private def filterByMetadata( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + root: Path, + metadata: ParquetMetadata, + partition: Partition): Partition = { +val blockMetadatas = metadata.getBlocks.asScala +val parquetSchema = metadata.getFileMetaData.getSchema +val conjunctiveFilter = filters + .flatMap(ParquetFilters.createFilter(schema, _)) + .reduceOption(FilterApi.and) +conjunctiveFilter.map { conjunction => + val filteredBlocks = RowGroupFilter.filterRowGroups( --- End diff -- Also, doesn't this try to touch many files in driver-side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14447 **[Test build #63820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)** for PR 14447 at commit [`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14626 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63819/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14626 **[Test build #63819 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63819/consoleFull)** for PR 14626 at commit [`2723eca`](https://github.com/apache/spark/commit/2723ecadcec4baad697639023fba6aafa373f7d6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/14447#discussion_r74871845 --- Diff: R/pkg/R/mllib.R --- @@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"), return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) }) +#' Multilayer Perceptron Classification Model +#' +#' \code{spark.mlp} fits a multi-layer perceptron neural network model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' For more details, see +#' \href{http://spark.apache.org/docs/latest/ml-classification-regression.html +#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}. +#' +#' @param data A \code{SparkDataFrame} of observations and labels for model fitting +#' @param blockSize BlockSize parameter +#' @param layers Integer vector containing the number of nodes for each layer +#' @param solver Solver parameter, supported options: "gd" (minibatch gradient descent) or "l-bfgs" +#' @param maxIter Maximum iteration number +#' @param tol Convergence tolerance of iterations +#' @param stepSize StepSize parameter +#' @param seed Seed parameter for weights initialization +#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron Classification Model +#' @rdname spark.mlp +#' @aliases spark.mlp,SparkDataFrame-method +#' @name spark.mlp +#' @seealso \link{read.ml} +#' @export +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm") +#' +#' # fit a Multilayer Perceptron Classification Model +#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", +#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.mlp since 2.1.0 +setMethod("spark.mlp", signature(data = "SparkDataFrame"), + function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100, + tol = 0.5, stepSize = 1, seed = 1, ...) { +jobj <- callJStatic("org.apache.spark.ml.r.MultilayerPerceptronClassifierWrapper", +"fit", data@sdf, as.integer(blockSize), as.array(layers), +as.character(solver), as.integer(maxIter), as.numeric(tol), +as.numeric(stepSize), as.integer(seed)) +return(new("MultilayerPerceptronClassificationModel", jobj = jobj)) + }) + +# Makes predictions from a model produced by spark.mlp(). + +#' @param newData A SparkDataFrame for testing +#' @return \code{predict} returns a SparkDataFrame containing predicted labeled in a column named +#' "prediction" +#' @rdname spark.mlp +#' @aliases predict,MultilayerPerceptronClassificationModel-method +#' @export +#' @note predict(MultilayerPerceptronClassificationModel) since 2.1.0 +setMethod("predict", signature(object = "MultilayerPerceptronClassificationModel"), + function(object, newData) { +return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) + }) + +# Returns the summary of a Multilayer Perceptron Classification Model produced by \code{spark.mlp} + +#' @param object A Multilayer Perceptron Classification Model fitted by \code{spark.mlp} --- End diff -- ok, fixing it now. thanks Felix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Thank you, @nsyca! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13428: [SPARK-12666][CORE] SparkSubmit packages fix for when 'd...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/13428 Thanks for the review @JoshRosen, I made the requested changes and tested it out once more. I think it is low risk because it is pretty well isolated to this particular issue and only improves on how it was before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Hmm. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Yep. Here is the output. ```scala scala> val a = Seq((1,2),(2,3)).toDF("a","b").createOrReplaceTempView("A") scala> val b = Seq((2,5),(3,4)).toDF("a","c").createOrReplaceTempView("B") scala> sql("select A.A,B.A,A.B,B.C from A full join B on A.A=B.A").show +++++ | A| A| B| C| +++++ | 1|null| 2|null| |null| 3|null| 4| | 2| 2| 3| 5| +++++ scala> sql("select A.A,B.A from A full join B on A.A=B.A where coalesce(A.B,B.C) is not null").show +---+---+ | A| A| +---+---+ | 2| 2| +---+---+ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14447#discussion_r74870955 --- Diff: R/pkg/R/mllib.R --- @@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"), return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) }) +#' Multilayer Perceptron Classification Model +#' +#' \code{spark.mlp} fits a multi-layer perceptron neural network model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' For more details, see +#' \href{http://spark.apache.org/docs/latest/ml-classification-regression.html +#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}. +#' +#' @param data A \code{SparkDataFrame} of observations and labels for model fitting +#' @param blockSize BlockSize parameter +#' @param layers Integer vector containing the number of nodes for each layer +#' @param solver Solver parameter, supported options: "gd" (minibatch gradient descent) or "l-bfgs" +#' @param maxIter Maximum iteration number +#' @param tol Convergence tolerance of iterations +#' @param stepSize StepSize parameter +#' @param seed Seed parameter for weights initialization +#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron Classification Model +#' @rdname spark.mlp +#' @aliases spark.mlp,SparkDataFrame-method +#' @name spark.mlp +#' @seealso \link{read.ml} +#' @export +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm") +#' +#' # fit a Multilayer Perceptron Classification Model +#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", +#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.mlp since 2.1.0 +setMethod("spark.mlp", signature(data = "SparkDataFrame"), + function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100, + tol = 0.5, stepSize = 1, seed = 1, ...) { +jobj <- callJStatic("org.apache.spark.ml.r.MultilayerPerceptronClassifierWrapper", +"fit", data@sdf, as.integer(blockSize), as.array(layers), +as.character(solver), as.integer(maxIter), as.numeric(tol), +as.numeric(stepSize), as.integer(seed)) +return(new("MultilayerPerceptronClassificationModel", jobj = jobj)) + }) + +# Makes predictions from a model produced by spark.mlp(). + +#' @param newData A SparkDataFrame for testing --- End diff -- oops. add @param for object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14447#discussion_r74870995 --- Diff: R/pkg/R/mllib.R --- @@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"), return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) }) +#' Multilayer Perceptron Classification Model +#' +#' \code{spark.mlp} fits a multi-layer perceptron neural network model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' For more details, see +#' \href{http://spark.apache.org/docs/latest/ml-classification-regression.html +#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}. +#' +#' @param data A \code{SparkDataFrame} of observations and labels for model fitting +#' @param blockSize BlockSize parameter +#' @param layers Integer vector containing the number of nodes for each layer +#' @param solver Solver parameter, supported options: "gd" (minibatch gradient descent) or "l-bfgs" +#' @param maxIter Maximum iteration number +#' @param tol Convergence tolerance of iterations +#' @param stepSize StepSize parameter +#' @param seed Seed parameter for weights initialization +#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron Classification Model +#' @rdname spark.mlp +#' @aliases spark.mlp,SparkDataFrame-method +#' @name spark.mlp +#' @seealso \link{read.ml} +#' @export +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm") +#' +#' # fit a Multilayer Perceptron Classification Model +#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", +#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.mlp since 2.1.0 +setMethod("spark.mlp", signature(data = "SparkDataFrame"), + function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100, + tol = 0.5, stepSize = 1, seed = 1, ...) { +jobj <- callJStatic("org.apache.spark.ml.r.MultilayerPerceptronClassifierWrapper", +"fit", data@sdf, as.integer(blockSize), as.array(layers), +as.character(solver), as.integer(maxIter), as.numeric(tol), +as.numeric(stepSize), as.integer(seed)) +return(new("MultilayerPerceptronClassificationModel", jobj = jobj)) + }) + +# Makes predictions from a model produced by spark.mlp(). + +#' @param newData A SparkDataFrame for testing +#' @return \code{predict} returns a SparkDataFrame containing predicted labeled in a column named +#' "prediction" +#' @rdname spark.mlp +#' @aliases predict,MultilayerPerceptronClassificationModel-method +#' @export +#' @note predict(MultilayerPerceptronClassificationModel) since 2.1.0 +setMethod("predict", signature(object = "MultilayerPerceptronClassificationModel"), + function(object, newData) { +return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) + }) + +# Returns the summary of a Multilayer Perceptron Classification Model produced by \code{spark.mlp} + +#' @param object A Multilayer Perceptron Classification Model fitted by \code{spark.mlp} --- End diff -- add `@param ... Currently not used` for CRAN check --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14580 @dongjoon-hyun, could you please try this on your PR? val a = Seq((1,2),(2,3)).toDF("a","b").createOrReplaceTempView("A") val b = Seq((2,5),(3,4)).toDF("a","c").createOrReplaceTempView("B") sql("select A.A,B.A,A.B,B.C from A full join B on A.A=B.A").show sql("select A.A,B.A from A full join B on A.A=B.A where coalesce(A.B,B.C) is not null").show How many rows do you get from the last and the second last statements? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptro...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14447#discussion_r74870693 --- Diff: R/pkg/R/mllib.R --- @@ -414,6 +421,94 @@ setMethod("predict", signature(object = "KMeansModel"), return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) }) +#' Multilayer Perceptron Classification Model +#' +#' \code{spark.mlp} fits a multi-layer perceptron neural network model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' For more details, see +#' \href{http://spark.apache.org/docs/latest/ml-classification-regression.html +#' #multilayer-perceptron-classifier}{Multilayerperceptron classifier}. +#' +#' @param data A \code{SparkDataFrame} of observations and labels for model fitting +#' @param blockSize BlockSize parameter +#' @param layers Integer vector containing the number of nodes for each layer +#' @param solver Solver parameter, supported options: "gd" (minibatch gradient descent) or "l-bfgs" +#' @param maxIter Maximum iteration number +#' @param tol Convergence tolerance of iterations +#' @param stepSize StepSize parameter +#' @param seed Seed parameter for weights initialization +#' @return \code{spark.mlp} returns a fitted Multilayer Perceptron Classification Model +#' @rdname spark.mlp +#' @aliases spark.mlp,SparkDataFrame-method +#' @name spark.mlp +#' @seealso \link{read.ml} +#' @export +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm") +#' +#' # fit a Multilayer Perceptron Classification Model +#' model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", +#'maxIter = 100, tol = 0.5, stepSize = 1, seed = 1) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.mlp since 2.1.0 +setMethod("spark.mlp", signature(data = "SparkDataFrame"), + function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100, + tol = 0.5, stepSize = 1, seed = 1, ...) { --- End diff -- We are working on the others in PR #14558, if it's not needed, I think we should remove `...` as of now CRAN check will flag this. We can always add parameter later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14626: [SPARK-16519][SPARKR] Handle SparkR RDD generics that cr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14626 **[Test build #63819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63819/consoleFull)** for PR 14626 at commit [`2723eca`](https://github.com/apache/spark/commit/2723ecadcec4baad697639023fba6aafa373f7d6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14641: [Minor] [SparkR] spark.glm weightCol should in the signa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14641 I think tests are only passing string, but we should coerce this to be safe. LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14392#discussion_r74870049 --- Diff: R/pkg/R/mllib.R --- @@ -632,3 +659,110 @@ setMethod("predict", signature(object = "AFTSurvivalRegressionModel"), function(object, newData) { return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) }) + +#' Multivariate Gaussian Mixture Model (GMM) +#' +#' Fits multivariate gaussian mixture model against a Spark DataFrame, similarly to R's +#' mvnormalmixEM(). Users can call \code{summary} to print a summary of the fitted model, +#' \code{predict} to make predictions on new data, and \code{write.ml}/\code{read.ml} +#' to save/load fitted models. +#' +#' @param data SparkDataFrame for training +#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#'Note that the response variable of formula is empty in spark.gaussianMixture. +#' @param k Number of independent Gaussians in the mixture model. +#' @param maxIter Maximum iteration number +#' @param tol The convergence tolerance +#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method +#' @return \code{spark.gaussianMixture} returns a fitted multivariate gaussian mixture model +#' @rdname spark.gaussianMixture +#' @name spark.gaussianMixture +#' @seealso mixtools: \url{https://cran.r-project.org/web/packages/mixtools/} +#' @export +#' @examples +#' \dontrun{ +#' sparkR.session() +#' library(mvtnorm) +#' set.seed(100) +#' a <- rmvnorm(4, c(0, 0)) +#' b <- rmvnorm(6, c(3, 4)) +#' data <- rbind(a, b) +#' df <- createDataFrame(as.data.frame(data)) +#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2) +#' summary(model) +#' +#' # fitted values on training data +#' fitted <- predict(model, df) +#' head(select(fitted, "V1", "prediction")) +#' +#' # save fitted model to input path +#' path <- "path/to/model" +#' write.ml(model, path) +#' +#' # can also read back the saved model and print +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.gaussianMixture since 2.1.0 +#' @seealso \link{predict}, \link{read.ml}, \link{write.ml} +setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, k = 2, maxIter = 100, tol = 0.01) { +formula <- paste(deparse(formula), collapse = "") +jobj <- callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf, +formula, as.integer(k), as.integer(maxIter), tol) +return(new("GaussianMixtureModel", jobj = jobj)) + }) + +# Get the summary of a multivariate gaussian mixture model + +#' @param object A fitted gaussian mixture model +#' @return \code{summary} returns the model's lambda, mu, sigma and posterior +#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method +#' @rdname spark.gaussianMixture +#' @export +#' @note summary(GaussianMixtureModel) since 2.1.0 +setMethod("summary", signature(object = "GaussianMixtureModel"), + function(object, ...) { +jobj <- object@jobj +is.loaded <- callJMethod(jobj, "isLoaded") +lambda <- unlist(callJMethod(jobj, "lambda")) +muList <- callJMethod(jobj, "mu") +sigmaList <- callJMethod(jobj, "sigma") +k <- callJMethod(jobj, "k") +dim <- callJMethod(jobj, "dim") +mu <- c() +for (i in 1 : k) { + start <- (i - 1) * dim + 1 + end <- i * dim + mu[[i]] <- unlist(muList[start : end]) +} +sigma <- c() +for (i in 1 : k) { + start <- (i - 1) * dim * dim + 1 + end <- i * dim * dim + sigma[[i]] <- t(matrix(sigmaList[start : end], ncol = dim)) +} +posterior <- if (is.loaded) { + NULL +} else { + dataFrame(callJMethod(jobj, "posterior")) +} +return(list(lambda = lambda, mu = mu, sigma = sigma, + posterior = posterior, is.loaded = is.loaded)) + }) + +# Predicted values based on a gaussian mixture model + +#' @param newData SparkDataFrame for testing +#' @return \code{predict} returns a SparkDataFrame containing predicted labels in a column named +#' "prediction" +#' @return \code{predict} returns the predicted values
[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14392#discussion_r74869983 --- Diff: R/pkg/R/mllib.R --- @@ -632,3 +659,110 @@ setMethod("predict", signature(object = "AFTSurvivalRegressionModel"), function(object, newData) { return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf))) }) + +#' Multivariate Gaussian Mixture Model (GMM) +#' +#' Fits multivariate gaussian mixture model against a Spark DataFrame, similarly to R's +#' mvnormalmixEM(). Users can call \code{summary} to print a summary of the fitted model, +#' \code{predict} to make predictions on new data, and \code{write.ml}/\code{read.ml} +#' to save/load fitted models. +#' +#' @param data SparkDataFrame for training +#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#'Note that the response variable of formula is empty in spark.gaussianMixture. +#' @param k Number of independent Gaussians in the mixture model. +#' @param maxIter Maximum iteration number +#' @param tol The convergence tolerance +#' @aliases spark.gaussianMixture,SparkDataFrame,formula-method +#' @return \code{spark.gaussianMixture} returns a fitted multivariate gaussian mixture model +#' @rdname spark.gaussianMixture +#' @name spark.gaussianMixture +#' @seealso mixtools: \url{https://cran.r-project.org/web/packages/mixtools/} +#' @export +#' @examples +#' \dontrun{ +#' sparkR.session() +#' library(mvtnorm) +#' set.seed(100) +#' a <- rmvnorm(4, c(0, 0)) +#' b <- rmvnorm(6, c(3, 4)) +#' data <- rbind(a, b) +#' df <- createDataFrame(as.data.frame(data)) +#' model <- spark.gaussianMixture(df, ~ V1 + V2, k = 2) +#' summary(model) +#' +#' # fitted values on training data +#' fitted <- predict(model, df) +#' head(select(fitted, "V1", "prediction")) +#' +#' # save fitted model to input path +#' path <- "path/to/model" +#' write.ml(model, path) +#' +#' # can also read back the saved model and print +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.gaussianMixture since 2.1.0 +#' @seealso \link{predict}, \link{read.ml}, \link{write.ml} +setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, k = 2, maxIter = 100, tol = 0.01) { +formula <- paste(deparse(formula), collapse = "") +jobj <- callJStatic("org.apache.spark.ml.r.GaussianMixtureWrapper", "fit", data@sdf, +formula, as.integer(k), as.integer(maxIter), tol) --- End diff -- add `as.numeric(tol)` if we could, since tol is not in the signature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14392#discussion_r74869872 --- Diff: R/pkg/R/mllib.R --- @@ -526,6 +533,24 @@ setMethod("write.ml", signature(object = "KMeansModel", path = "character"), invisible(callJMethod(writer, "save", path)) }) +# Save fitted MLlib model to the input path + +#' @param path The directory where the model is saved +#' @param overwrite Overwrites or not if the output path already exists. Default is FALSE +#' which means throw exception if the output path exists. +#' +#' @rdname spark.gaussianMixture --- End diff -- let's add `@aliases` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14392#discussion_r74869843 --- Diff: R/pkg/R/generics.R --- @@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s #' @export setGeneric("spark.survreg", function(data, formula, ...) { standardGeneric("spark.survreg") }) +#' @rdname spark.gaussianMixture +#' @export +setGeneric("spark.gaussianMixture", + function(data, formula, ...) { + standardGeneric("spark.gaussianMixture") --- End diff -- does it fit one line, like the others? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14229#discussion_r74869802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import scala.collection.mutable + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.SparkException +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.ml.clustering.{LDA, LDAModel} +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover} +import org.apache.spark.ml.linalg.VectorUDT +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.StringType + + +private[r] class LDAWrapper private ( +val pipeline: PipelineModel, +val logLikelihood: Double, +val logPerplexity: Double, +val vocabulary: Array[String]) extends MLWritable { + + import LDAWrapper._ + + private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel] + private val preprocessor: PipelineModel = +new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1)) + + def transform(data: Dataset[_]): DataFrame = { +pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL) + } + + def computeLogPerplexity(data: Dataset[_]): Double = { +lda.logPerplexity(preprocessor.transform(data)) + } + + lazy val topicIndices: DataFrame = lda.describeTopics(10) --- End diff -- I think we could add additional parameter if it could be useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14506 **[Test build #63818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63818/consoleFull)** for PR 14506 at commit [`3042af2`](https://github.com/apache/spark/commit/3042af2f0e9ae82e40d14e950a1036b9e417dbc9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13758 you can take a look at `GenerateUnsafeProjection`, if the `ArrayData` is already an unsafe array, we will copy it directly, no iteration is needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14229#discussion_r74869681 --- Diff: R/pkg/R/mllib.R --- @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula return(new("AFTSurvivalRegressionModel", jobj = jobj)) }) +#' Latent Dirichlet Allocation +#' +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' +#' @param data A SparkDataFrame for training +#' @param features Features column name, default "features". Either Vector format column or String --- End diff -- could you link to libSVM's ml.Vector-format? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/8880 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63817/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/8880 **[Test build #63817 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63817/consoleFull)** for PR 8880 at commit [`f5af081`](https://github.com/apache/spark/commit/f5af08147ffcfcd883b758219b40baf0eb2e4e16). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/8880 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14229#discussion_r74869578 --- Diff: R/pkg/R/mllib.R --- @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula return(new("AFTSurvivalRegressionModel", jobj = jobj)) }) +#' Latent Dirichlet Allocation +#' +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' +#' @param data A SparkDataFrame for training +#' @param features Features column name, default "features". Either Vector format column or String +#'format column are accepted. +#' @param k Number of topics, default 10 +#' @param maxIter Maximum iterations, default 20 +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online" +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in +# each iteration of mini-batch gradient descent, in range (0, 1], default 0.05 +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for +#'the prior placed on topic distributions over terms, default -1 to set automatically on the +#'Spark side. Use \code{summary} to retrieve the effective topicConcentration. +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the +#'prior placed on documents distributions over topics (\code{theta}), default -1 to set +#'automatically on the Spark side. Use \code{summary} to retrieve the effective +#'docConcentration. +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected +#'given training data with string format column. --- End diff -- right, I think "affects" is the right word to use --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org