[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/4047 [SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM **This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).** The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it. In particular, see the API and Planning for the Future sections. ## Goals * Settle on a public API which may eventually include: * more inference algorithms * more options / functionality * Have an initial easy-to-understand implementation which others may improve. * This is NOT intended to support every topic model out there. However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much. * This may not be very scalable currently. It will be important to check and improve accuracy. For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc. ## Sketch of contents of this PR **Dependency: This makes MLlib depend on GraphX.** Files and classes: * LDA.scala (441 lines): * class LDA (main estimator class) * LDA.Document (text + document ID) * LDAModel.scala (266 lines) * abstract class LDAModel * class LocalLDAModel * class DistributedLDAModel * LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer * LDASuite.scala (144 lines) Data/model representation and algorithm: * Data/model: Uses GraphX, with term vertices + document vertices * Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh. On Smoothing and Inference for Topic Models. UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1) * For more details, please see the description in the âDEVELOPERS NOTEâ in LDA.scala ## Design notes Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) Here, I list the main changes AFTER the design doc was posted. Design decisions: * logLikelihood() computes the log likelihood of the data and the current point estimate of parameters. This is different from the likelihood of the data given the hyperparameters, which would be harder to compute. Iâd describe the current approach as more frequentist, whereas the harder approach would be more Bayesian. * The current API takes Documents as token count vectors. I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR. I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings). * Hyperparameters should be set differently for different inference/learning algorithms. See Asuncion et al. (2009) in the design doc for a good demonstration. I encourage good behavior via defaults and warning messages. Items planned for future PRs: * perplexity * API taking Strings ## Questions for reviewers * Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)? * Pro: We may someday want LinearDiscriminantAnalysis. * Con: Very long names * Should LDA reside in clustering? Or do we want a sub-package? * mllib.topicmodel * mllib.clustering.topicmodel * Does the API seem reasonable and extensible? * Unit tests: * Should there be a test which checks a clustering results? E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters. Does that sound useful or too flaky? ## Other notes This has not been tested much for scaling. I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics. Running it for 500 iterations made it fail because of GC problems. Future PRs will need to improve the scaling. ## Thanks to⦠* @dlwh for the initial implementation * + @jegonzal for some code in the initial implementation * The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: @akopich @witgo @yinxusen @dlwh @EntilZha @jegonzal @IlyaKozlov CC: @mengxr You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark davidhall-lda Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69990018 oh good point Marcelo -- I forgot to add that I've only done this for `core` in this PR. I wanted to ask others whether its worthwhile to do in other projects or not before I go digging into each one of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-69991087 [Test build #25565 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25565/consoleFull) for PR 3833 at commit [`7ac4dfc`](https://github.com/apache/spark/commit/7ac4dfc4a41b20c97c29fdf60045aca64fe08a6f). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r22963462 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala --- @@ -0,0 +1,244 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import scala.collection.mutable.ArrayBuffer + +import java.text.BreakIterator + +import scopt.OptionParser + +import org.apache.log4j.{Level, Logger} + +import org.apache.spark.{SparkContext, SparkConf} +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.clustering.LDA +import org.apache.spark.mllib.clustering.LDA.Document +import org.apache.spark.mllib.linalg.SparseVector +import org.apache.spark.rdd.RDD + + +/** + * An example Latent Dirichlet Allocation (LDA) app. Run with + * {{{ + * ./bin/run-example mllib.DenseKMeans [options] input --- End diff -- Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3833#discussion_r22967437 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -61,20 +67,70 @@ class LogisticRegressionModel ( override protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double) = { -val margin = weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept -val score = 1.0 / (1.0 + math.exp(-margin)) -threshold match { - case Some(t) = if (score t) 1.0 else 0.0 - case None = score +// If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression. +if (dataMatrix.size == weightMatrix.size) { + val margin = dot(weights, dataMatrix) + intercept + val score = 1.0 / (1.0 + math.exp(-margin)) + threshold match { +case Some(t) = if (score t) 1.0 else 0.0 +case None = score + } +} else { + val dataWithBiasSize = weightMatrix.size / (nClasses - 1) + val dataWithBias = if(dataWithBiasSize == dataMatrix.size) { +dataMatrix + } else { +assert(dataMatrix.size + 1 == dataWithBiasSize) +MLUtils.appendBias(dataMatrix) + } + + val margins = Array.ofDim[Double](nClasses) + + val weightsArray = weights match { + case dv: DenseVector = dv.values + case _ = +throw new IllegalArgumentException( + sweights only supports dense vector but got type ${weights.getClass}.) + } + + var i = 0 + while (i nClasses - 1) { --- End diff -- There is `margins(i + 1) = margin`, and the first margins(0) == 0, so using `(0 until nClasses).map` will require couple more if statement. I change it to for loop since it's not tight loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4803] [streaming] Remove duplicate Regi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3648#issuecomment-69973099 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25554/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user tnachen commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-69976532 @andrewor14 I wonder if you have time to review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5231][WebUI] History Server shows wrong...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4029#discussion_r22963720 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -469,6 +471,7 @@ private[spark] object JsonProtocol { def jobStartFromJson(json: JValue): SparkListenerJobStart = { val jobId = (json \ Job ID).extract[Int] +val submissionTime = (json \ Submission Time).extractOpt[Long] --- End diff -- Similarly, you should also add a backwards compatibility test; this can be a few lines in the existing SparkListenerJobStart backward compatibility test: https://github.com/sarutak/spark/blob/SPARK-5231/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala#L240 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5249] Added type specific set functions...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4042#issuecomment-69979109 If you run the MiMa checks, I'm pretty sure that this will break binary compatibility because it changes the signature of a public method. Let's see, though: Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4803] [streaming] Remove duplicate Regi...
Github user ilayaperumalg commented on the pull request: https://github.com/apache/spark/pull/3648#issuecomment-69980475 The test passed now (after increasing the timeout value). Can someone re-run the test to see if the test result is consistent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-69981443 [Test build #25561 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25561/consoleFull) for PR 3916 at commit [`61919df`](https://github.com/apache/spark/commit/61919df21853eba479ddb591fb89dcecfd341988). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/4050 SPARK-5199. Input metrics should show up for InputFormats that return Co... ...mbineFileSplits You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-5199 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4050.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4050 commit 9962dd097425442d62778f72911c6320c812f153 Author: Sandy Ryza sa...@cloudera.com Date: 2015-01-14T21:17:02Z SPARK-5199. Input metrics should show up for InputFormats that return CombineFileSplits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-69993842 [Test build #25566 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25566/consoleFull) for PR 3833 at commit [`4e16781`](https://github.com/apache/spark/commit/4e1678160f135f263b242b4cf1c28c95886bc11b). * This patch **fails to build**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-69993846 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25566/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3833#discussion_r22963566 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -18,30 +18,36 @@ package org.apache.spark.mllib.classification import org.apache.spark.annotation.Experimental -import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.linalg.BLAS.dot +import org.apache.spark.mllib.linalg.{DenseVector, Vector} import org.apache.spark.mllib.optimization._ import org.apache.spark.mllib.regression._ -import org.apache.spark.mllib.util.DataValidators +import org.apache.spark.mllib.util.{DataValidators, MLUtils} import org.apache.spark.rdd.RDD /** - * Classification model trained using Logistic Regression. + * Classification model trained using Multinomial/Binary Logistic Regression. * * @param weights Weights computed for every feature. - * @param intercept Intercept computed for this model. + * @param intercept Intercept computed for this model. (Only used in Binary Logistic Regression. + * In Multinomial Logistic Regression, the intercepts will not be a single values, + * so the intercepts will be part of the weights.) + * @param nClasses The number of possible outcomes for Multinomial Logistic Regression. + * The default value is 2 which is Binary Logistic Regression. */ class LogisticRegressionModel ( override val weights: Vector, -override val intercept: Double) +override val intercept: Double, +nClasses: Int = 2) extends GeneralizedLinearModel(weights, intercept) with ClassificationModel with Serializable { private var threshold: Option[Double] = Some(0.5) /** * :: Experimental :: - * Sets the threshold that separates positive predictions from negative predictions. An example - * with prediction score greater than or equal to this threshold is identified as an positive, - * and negative otherwise. The default value is 0.5. + * Sets the threshold that separates positive predictions from negative predictions + * in Binary Logistic Regression. An example with prediction score greater than or equal to + * this threshold is identified as an positive, and negative otherwise. The default value is 0.5. */ --- End diff -- I think the model should have api to predict as probability, and we have another transformer to take threshold so we can reuse the logic for all the probabilistic model. I will like to remove threshold stuff from LOR entirely. @mengxr what do u think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-69981602 HI @andrewor14 , thanks for all the comments. I updated the patch and I think I covered everything. Mainly, I fixed the issue with new lines that you asked about, and also now `sbt assembly` should work without having to do `sbt package` too. I notice that SparkLauncher doesn't have the full set of options found in SparkSubmitArguments My thinking is that aside from the exposed APIs, everything else would be set using `SparkLauncher.setConf()`. I even though about removing some other methods (like `setMaster()`) but decided to leave the most common ones easily accessible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3833#discussion_r22967246 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -61,20 +67,70 @@ class LogisticRegressionModel ( override protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, --- End diff -- The argument to the gradient calculation is properly a vector of weights, so that need not change for API reasons. So is it just having to do the translation? it's a line of code I think, although requires a copy. Maybe someone else can weigh in with an opinion too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3670#issuecomment-69985480 [Test build #25562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25562/consoleFull) for PR 3670 at commit [`8413c50`](https://github.com/apache/spark/commit/8413c5010527f51cb8fc6401201a0d5f1f8ef6e9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5193][SQL] Tighten up SQLContext API
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4049#issuecomment-69990207 [Test build #25564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25564/consoleFull) for PR 4049 at commit [`4a38c9b`](https://github.com/apache/spark/commit/4a38c9b15ecc04f2ae2f285af5742608fc91549b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...
Github user ksakellis commented on the pull request: https://github.com/apache/spark/pull/4050#issuecomment-69996188 This mostly LGTM. My only concern is with the proliferation of copy pasta between the HadoopRDD and NewHadoopRDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5228][WebUI] Hide tables for Active Jo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4028 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user dlwh commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r22962641 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala --- @@ -0,0 +1,244 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import scala.collection.mutable.ArrayBuffer + +import java.text.BreakIterator + +import scopt.OptionParser + +import org.apache.log4j.{Level, Logger} + +import org.apache.spark.{SparkContext, SparkConf} +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.clustering.LDA +import org.apache.spark.mllib.clustering.LDA.Document +import org.apache.spark.mllib.linalg.SparseVector +import org.apache.spark.rdd.RDD + + +/** + * An example Latent Dirichlet Allocation (LDA) app. Run with + * {{{ + * ./bin/run-example mllib.DenseKMeans [options] input --- End diff -- (rename) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4014] Add TaskContext.attemptNumber and...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3849 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5234][ml]examples for ml don't have spa...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4044#issuecomment-69980197 Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5234][ml]examples for ml don't have spa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4044 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-69989172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25560/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69990485 IIRC there are a few tests under `sql/` that use local-cluster too, but can't name any from the top of my head. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69991447 Just curious - what is the before and after time? I.e. what fraction of time does this cut down on? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-69992822 [Test build #25566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25566/consoleFull) for PR 3833 at commit [`4e16781`](https://github.com/apache/spark/commit/4e1678160f135f263b242b4cf1c28c95886bc11b). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69997514 This is not a terribly useful observation, but, this is what `surefire` vs `failsafe` is for in the Maven world, without making a custom mechanism. But we have the SBT build too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4803] [streaming] Remove duplicate Regi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3648#issuecomment-69973089 [Test build #25554 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25554/consoleFull) for PR 3648 at commit [`868efab`](https://github.com/apache/spark/commit/868efabd2c43a662b8ccfb1651192dfb95f80f06). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5231][WebUI] History Server shows wrong...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4029#issuecomment-69976817 This is a nice patch, but I wonder whether there's a smaller fix that doesn't require changing SparkListener events; that will make it easier to backport that patch to `branch-1.2`. The job page already knows the last stage in the job (the result stage), so I think we might be able to use the final stage's completion time as the job completion time and the first stage's submission time as the job start time. However, there are a couple of corner-cases that this might miss: I could submit a job that spends a bunch of time queued behind other jobs before its first stage starts running, in which case it would be helpful to be able to distinguish between scheduler delays and stage durations. Similarly, there might be a corner-case related to the job completion time if we have a job that spends a lot of time fetching results back to the driver after they've been stored in the block manager by completed tasks. So, I guess the approach here seems like the right fix. I'd guess we might be able to do a separate fix in branch-1.2 to use the first/last stage time approximations. I have a couple of comments on the code here, so I'll comment on those inline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5095] Support capping cores and launch ...
Github user tnachen commented on a diff in the pull request: https://github.com/apache/spark/pull/4027#discussion_r22963154 --- Diff: docs/running-on-mesos.md --- @@ -226,6 +226,20 @@ See the [configuration page](configuration.html) for information on Spark config The final total amount of memory allocated is the maximum value between executor memory plus memoryOverhead, and overhead fraction (1.07) plus the executor memory. /td /tr +tr + tdcodespark.mesos.coarse.cpu.max/code/td --- End diff -- Good catch! Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] Fix tiny typo in BlockManager
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4046#issuecomment-69978070 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25557/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-69978003 [Test build #25560 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25560/consoleFull) for PR 4047 at commit [`984c414`](https://github.com/apache/spark/commit/984c414ce2bfc14fc1bef35adfca78db4770ff37). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5235] Make SQLConf Serializable
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4031 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-69979680 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25559/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5193][SQL] Tighten up SQLContext API
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/4049 [SPARK-5193][SQL] Tighten up SQLContext API 1. Removed 2 implicits (logicalPlanToSparkQuery and baseRelationToSchemaRDD) 2. Moved extraStrategies into ExperimentalMethods. 3. Made private methods protected[sql] so they don't show up in javadocs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark sqlContext-refactor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4049.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4049 commit 4a38c9b15ecc04f2ae2f285af5742608fc91549b Author: Reynold Xin r...@databricks.com Date: 2015-01-14T20:47:49Z [SPARK-5193][SQL] Tighten up SQLContext API 1. Removed 2 implicits (logicalPlanToSparkQuery and baseRelationToSchemaRDD) 2. Moved extraStrategies into ExperimentalMethods. 3. Made private methods protected[sql] so they don't show up in javadocs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-69992401 [Test build #25565 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25565/consoleFull) for PR 3833 at commit [`7ac4dfc`](https://github.com/apache/spark/commit/7ac4dfc4a41b20c97c29fdf60045aca64fe08a6f). * This patch **fails to build**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-69995071 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25561/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-69995062 [Test build #25561 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25561/consoleFull) for PR 3916 at commit [`61919df`](https://github.com/apache/spark/commit/61919df21853eba479ddb591fb89dcecfd341988). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/4050#discussion_r22972019 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -219,6 +220,9 @@ class HadoopRDD[K, V]( val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) { SparkHadoopUtil.get.getFSBytesReadOnThreadCallback( split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf) + } else if (split.inputSplit.value.isInstanceOf[CombineFileSplit]) { +SparkHadoopUtil.get.getFSBytesReadOnThreadCallback( + split.inputSplit.value.asInstanceOf[CombineFileSplit].getPath(0), jobConf) --- End diff -- The issue is that those are actually two different classes. There's a CombineFileSplit for the old MR API (used by HadoopRDD) and a CombineFileSplit for the new one (used by NewHadoopRDD). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2909] [MLlib] [PySpark] SparseVector in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4025 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-69971352 [Test build #25558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25558/consoleFull) for PR 4047 at commit [`c6e4308`](https://github.com/apache/spark/commit/c6e430867ca32ca6f409f953a2d47dd04a1e6e53). * This patch **fails Scala style tests**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * ` case class Document(counts: Vector, id: Long)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-69971357 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25558/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2909] [MLlib] [PySpark] SparseVector in...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4025#issuecomment-69971384 Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5231][WebUI] History Server shows wrong...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4029#discussion_r22963608 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -469,6 +471,7 @@ private[spark] object JsonProtocol { def jobStartFromJson(json: JValue): SparkListenerJobStart = { val jobId = (json \ Job ID).extract[Int] +val submissionTime = (json \ Submission Time).extractOpt[Long] --- End diff -- For backwards-compatibility, you should use `Utils.json` option here; see the block comment at https://github.com/sarutak/spark/blob/SPARK-5231/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L46, plus the examples elsewhere in this file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4014] Add TaskContext.attemptNumber and...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3849#issuecomment-69978436 I'm going to merge this into `master` (1.3.0) since it's a blocker for some tests that I want to write. I'll look into backporting this into maintenance branches, too, since that would allow me to backport regression tests that use the new methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4747 make it easy to skip IntegrationTes...
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/4048 SPARK-4747 make it easy to skip IntegrationTests * create an `IntegrationTest` tag * label all tests in core as an `IntegrationTest` if they use a `local-cluster` * make a `unit-test` task in sbt so its easy to skip all the unit tests in local development. On my laptop, this means that I can run `~unit-test` in my sbt console, which takes ~5 mins on the first run. But since it is calling `test-quick` under the hood, this means that as I make changes, it only re-runs the tests I've effected. so generally I can update on all tests in a second or two. Of course this means its skipping a bunch of important tests, but hopefully this is a useful subset of tests that can actually be run locally. If you don't skip the IntegrationTests, its totally impractical to ever get through even the first run of `test-quick` on my laptop. An added bonus is that this set of tests can be run without having to ever do the `mvn package` step, since we are never launching a full cluster as an external process. You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark SPARK-4746 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4048.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4048 commit 030cc0c0cb57d35a043e68509a3997fb3f1a3dc1 Author: Imran Rashid iras...@cloudera.com Date: 2015-01-13T21:45:38Z add IntegrationTest tag, and label a bunch of tests in core commit 30f4d636387e57e9c104024db5a20afcde1b7cbb Author: Imran Rashid iras...@cloudera.com Date: 2015-01-14T19:36:37Z add a unit-test task commit 3a8503227d53554155e5766ce12d48039854f163 Author: Imran Rashid iras...@cloudera.com Date: 2015-01-14T20:41:07Z fix task name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-69989155 [Test build #25560 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25560/consoleFull) for PR 4047 at commit [`984c414`](https://github.com/apache/spark/commit/984c414ce2bfc14fc1bef35adfca78db4770ff37). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3670#issuecomment-69993972 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25562/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3670#issuecomment-69993963 [Test build #25562 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25562/consoleFull) for PR 3670 at commit [`8413c50`](https://github.com/apache/spark/commit/8413c5010527f51cb8fc6401201a0d5f1f8ef6e9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4585. Spark dynamic executor allocation ...
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/4051 SPARK-4585. Spark dynamic executor allocation should use minExecutors as... ... initial number You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-4585 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4051.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4051 commit 9d7b6f98caff5de3db88d372853cccf012f36dc6 Author: Sandy Ryza sa...@cloudera.com Date: 2014-12-28T03:29:11Z SPARK-4585. Spark dynamic executor allocation should use minExecutors as initial number --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4585. Spark dynamic executor allocation ...
Github user ksakellis commented on a diff in the pull request: https://github.com/apache/spark/pull/4051#discussion_r22972319 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -73,12 +73,12 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) .orNull // If dynamic allocation is enabled, start at the max number of executors --- End diff -- Fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5235] Make SQLConf Serializable
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4031#issuecomment-69969756 [Test build #2 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2/consoleFull) for PR 4031 at commit [`c2103f5`](https://github.com/apache/spark/commit/c2103f57720627f44fe8ad8dcd1af8d9e2fc31f2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5235] Make SQLConf Serializable
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4031#issuecomment-69969765 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/2/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-69971147 [Test build #25558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25558/consoleFull) for PR 4047 at commit [`c6e4308`](https://github.com/apache/spark/commit/c6e430867ca32ca6f409f953a2d47dd04a1e6e53). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2909] [MLlib] [PySpark] SparseVector in...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4025#issuecomment-69971132 LGTM. @MechCoder The Scala code uses Breeze's index lookup, which uses bisection as well. You can try implementing bisection in MLlib and then doing a micro-benchmark. If there is a big difference, we will have the implementation in MLlib. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5228][WebUI] Hide tables for Active Jo...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4028#issuecomment-69972337 This matches the approach that I used for the job details page, so this looks good to me. I'm going to merge this into `master` (1.3.0). Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3833#discussion_r22963904 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -18,30 +18,36 @@ package org.apache.spark.mllib.classification import org.apache.spark.annotation.Experimental -import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.linalg.BLAS.dot +import org.apache.spark.mllib.linalg.{DenseVector, Vector} import org.apache.spark.mllib.optimization._ import org.apache.spark.mllib.regression._ -import org.apache.spark.mllib.util.DataValidators +import org.apache.spark.mllib.util.{DataValidators, MLUtils} import org.apache.spark.rdd.RDD /** - * Classification model trained using Logistic Regression. + * Classification model trained using Multinomial/Binary Logistic Regression. * * @param weights Weights computed for every feature. - * @param intercept Intercept computed for this model. + * @param intercept Intercept computed for this model. (Only used in Binary Logistic Regression. + * In Multinomial Logistic Regression, the intercepts will not be a single values, + * so the intercepts will be part of the weights.) + * @param nClasses The number of possible outcomes for Multinomial Logistic Regression. + * The default value is 2 which is Binary Logistic Regression. */ class LogisticRegressionModel ( override val weights: Vector, -override val intercept: Double) +override val intercept: Double, --- End diff -- addressed. thks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] Fix tiny typo in BlockManager
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4046#issuecomment-69978060 [Test build #25557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25557/consoleFull) for PR 4046 at commit [`a3e2a2f`](https://github.com/apache/spark/commit/a3e2a2f46d8d853d79b993ce0e22802aa243ae83). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3833#discussion_r22965406 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -61,20 +67,70 @@ class LogisticRegressionModel ( override protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, --- End diff -- I thought about having weights as a matrix, but it's required to change so many places. For example, the gradient object has to change, the underline `GeneralizedLinearAlgorithm` has to change as well. I'm thinking to have more clear APIs when we move the code to `ml` package since we can do it from scratch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69988559 [Test build #25563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25563/consoleFull) for PR 4048 at commit [`3a85032`](https://github.com/apache/spark/commit/3a8503227d53554155e5766ce12d48039854f163). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69988571 [Test build #25563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25563/consoleFull) for PR 4048 at commit [`3a85032`](https://github.com/apache/spark/commit/3a8503227d53554155e5766ce12d48039854f163). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4050#issuecomment-69994416 [Test build #25567 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25567/consoleFull) for PR 4050 at commit [`9962dd0`](https://github.com/apache/spark/commit/9962dd097425442d62778f72911c6320c812f153). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4585. Spark dynamic executor allocation ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4051#issuecomment-69996035 [Test build #25568 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25568/consoleFull) for PR 4051 at commit [`9d7b6f9`](https://github.com/apache/spark/commit/9d7b6f98caff5de3db88d372853cccf012f36dc6). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5231][WebUI] History Server shows wrong...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4029#issuecomment-69996068 [Test build #25569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25569/consoleFull) for PR 4029 at commit [`da8bd14`](https://github.com/apache/spark/commit/da8bd1498607be57b7d1e11c2e98fe92f3221bc0). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...
Github user ksakellis commented on a diff in the pull request: https://github.com/apache/spark/pull/4050#discussion_r22971793 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -219,6 +220,9 @@ class HadoopRDD[K, V]( val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) { SparkHadoopUtil.get.getFSBytesReadOnThreadCallback( split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf) + } else if (split.inputSplit.value.isInstanceOf[CombineFileSplit]) { +SparkHadoopUtil.get.getFSBytesReadOnThreadCallback( + split.inputSplit.value.asInstanceOf[CombineFileSplit].getPath(0), jobConf) --- End diff -- Can you push this logic down to the SparkHadoopUtil so that we don't duplicate it in two places (HadoopRDD and NewHadoopRDD). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5231][WebUI] History Server shows wrong...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/4029#issuecomment-69996037 @JoshRosen Thanks for your advises. I reflected your comment and added a test case. For now, I will take the original approach and also, I will try to address this issue using the approximation approach you mentioned for 1.2.x. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5228][WebUI] Hide tables for Active Jo...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4028#discussion_r22960975 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala --- @@ -121,27 +125,47 @@ private[ui] class AllJobsPage(parent: JobsTab) extends WebUIPage() { strongScheduling Mode: /strong {listener.schedulingMode.map(_.toString).getOrElse(Unknown)} /li -li - a href=#activestrongActive Jobs:/strong/a - {activeJobs.size} -/li -li - a href=#completedstrongCompleted Jobs:/strong/a - {completedJobs.size} -/li -li - a href=#failedstrongFailed Jobs:/strong/a - {failedJobs.size} -/li +{ + if (shouldShowActiveJobs) { +li + a href=#activestrongActive Jobs:/strong/a + {activeJobs.size} +/li + } +} +{ + if (shouldShowCompletedJobs) { +li + a href=#completedstrongCompleted Jobs:/strong/a + {completedJobs.size} +/li + } +} +{ + if (shouldShowFailedJobs) { +li + a href=#failedstrongFailed Jobs:/strong/a + {failedJobs.size} +/li + } +} /ul /div - val content = summary ++ -h4 id=activeActive Jobs ({activeJobs.size})/h4 ++ activeJobsTable ++ -h4 id=completedCompleted Jobs ({completedJobs.size})/h4 ++ completedJobsTable ++ -h4 id =failedFailed Jobs ({failedJobs.size})/h4 ++ failedJobsTable - - val helpText = A job is triggered by a action, like count() or saveAsTextFile(). + + var content = summary + if (shouldShowActiveJobs) { +content ++= h4 id=activeActive Jobs ({activeJobs.size})/h4 ++ + activeJobsTable + } + if (shouldShowCompletedJobs) { +content ++= h4 id=completedCompleted Jobs ({completedJobs.size})/h4 ++ + completedJobsTable + } + if (shouldShowFailedJobs) { +content ++= h4 id =failedFailed Jobs ({failedJobs.size})/h4 ++ + failedJobsTable + } + val helpText = A job is triggered by an action, like count() or saveAsTextFile(). + --- End diff -- Thanks for catching and fixing this typo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5228][WebUI] Hide tables for Active Jo...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4028#discussion_r22960930 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala --- @@ -47,7 +47,7 @@ private[ui] class AllJobsPage(parent: JobsTab) extends WebUIPage() { val lastStageData = lastStageInfo.flatMap { s = listener.stageIdToData.get((s.stageId, s.attemptId)) } - val isComplete = job.status == JobExecutionStatus.SUCCEEDED --- End diff -- Hmm, I guess this was unused. Good catch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5231][WebUI] History Server shows wrong...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4029#discussion_r22963362 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -58,6 +58,7 @@ case class SparkListenerTaskEnd( @DeveloperApi case class SparkListenerJobStart( jobId: Int, +time: Option[Long], --- End diff -- I guess this is an option for backwards-compatibility reasons? We definitely know the time when posting this event to the listener bus, so I think the right approach is to have time just be a regular `Long` and pass a dummy value (`-1`) when replaying JSON that's missing that field. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69988816 Hey Imran, haven't looked at the code, but `YarnClusterSuite` could probably use this tag too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-69988574 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25563/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-69992408 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25565/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5094][MLlib] Add Python API for Gradien...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3951#issuecomment-69995496 Taking a look now will add comments soon! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-69997838 [Test build #25570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25570/consoleFull) for PR 3833 at commit [`9cf9811`](https://github.com/apache/spark/commit/9cf98115c9b8ba76cd4b460e205ba87328c4e471). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5193][SQL] Tighten up SQLContext API
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4049#issuecomment-70001689 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25564/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2996] Implement userClassPathFirst for ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/3233#discussion_r22975789 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -762,46 +764,37 @@ object Client extends Logging { extraClassPath: Option[String] = None): Unit = { extraClassPath.foreach(addClasspathEntry(_, env)) addClasspathEntry(Environment.PWD.$(), env) - -// Normally the users app.jar is last in case conflicts with spark jars if (sparkConf.getBoolean(spark.yarn.user.classpath.first, false)) { - addUserClasspath(args, sparkConf, env) - addFileToClasspath(sparkJar(sparkConf), SPARK_JAR, env) - populateHadoopClasspath(conf, env) -} else { - addFileToClasspath(sparkJar(sparkConf), SPARK_JAR, env) - populateHadoopClasspath(conf, env) - addUserClasspath(args, sparkConf, env) + getUserClasspath(args, sparkConf).foreach { x = +addFileToClasspath(x, null, env) + } } - -// Append all jar files under the working directory to the classpath. -addClasspathEntry(Environment.PWD.$() + Path.SEPARATOR + *, env) --- End diff -- It's removed for two reasons: - It didn't serve any practical purpose - It could potentially lead to behavior that diverged from other cluster managers All jars distributed with `--jars` are added to the classpath automatically, withouth the need for this. The directory itself is also added, so things like `log4j.properties` uploaded by the user are in the classpath. The only change this causes is that files and archives (`--files` and `--archives`) would also end up in the app's classpath. This is the part that diverges from other cluster managers - if you use `--files` to add a jar file in standalone mode, the classes in that jar will not show up in the app's classpath. In Yarn mode they would, and I think that's wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2996] Implement userClassPathFirst for ...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3233#issuecomment-70004739 Hi @tgravescs , thanks for taking a look. Aside from all the unit tests I added, I explained the testing I did, including the code I used, in my very first comment at the top. Did you have any specific questions about that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4803] [streaming] Remove duplicate Regi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3648#issuecomment-70017327 [Test build #25577 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25577/consoleFull) for PR 3648 at commit [`868efab`](https://github.com/apache/spark/commit/868efabd2c43a662b8ccfb1651192dfb95f80f06). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22981585 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -80,69 +50,157 @@ class LogisticRegression extends Estimator[LogisticRegressionModel] with Logisti def setRegParam(value: Double): this.type = set(regParam, value) def setMaxIter(value: Int): this.type = set(maxIter, value) - def setLabelCol(value: String): this.type = set(labelCol, value) def setThreshold(value: Double): this.type = set(threshold, value) - def setFeaturesCol(value: String): this.type = set(featuresCol, value) - def setScoreCol(value: String): this.type = set(scoreCol, value) - def setPredictionCol(value: String): this.type = set(predictionCol, value) override def fit(dataset: SchemaRDD, paramMap: ParamMap): LogisticRegressionModel = { +// Check schema transformSchema(dataset.schema, paramMap, logging = true) -import dataset.sqlContext._ + +// Extract columns from data. If dataset is persisted, do not persist oldDataset. +val oldDataset = extractLabeledPoints(dataset, paramMap) val map = this.paramMap ++ paramMap -val instances = dataset.select(map(labelCol).attr, map(featuresCol).attr) - .map { case Row(label: Double, features: Vector) = -LabeledPoint(label, features) - }.persist(StorageLevel.MEMORY_AND_DISK) +val handlePersistence = dataset.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + oldDataset.persist(StorageLevel.MEMORY_AND_DISK) +} + +// Train model val lr = new LogisticRegressionWithLBFGS lr.optimizer .setRegParam(map(regParam)) .setNumIterations(map(maxIter)) -val lrm = new LogisticRegressionModel(this, map, lr.run(instances).weights) -instances.unpersist() +val oldModel = lr.run(oldDataset) +val lrm = new LogisticRegressionModel(this, map, oldModel.weights, oldModel.intercept) + +if (handlePersistence) { + oldDataset.unpersist() +} + // copy model params Params.inheritValues(map, this, lrm) lrm } - private[ml] override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { -validateAndTransformSchema(schema, paramMap, fitting = true) - } + override protected def featuresDataType: DataType = new VectorUDT } + /** * :: AlphaComponent :: + * * Model produced by [[LogisticRegression]]. */ @AlphaComponent class LogisticRegressionModel private[ml] ( override val parent: LogisticRegression, override val fittingParamMap: ParamMap, -weights: Vector) - extends Model[LogisticRegressionModel] with LogisticRegressionParams { +val weights: Vector, +val intercept: Double) + extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel] + with LogisticRegressionParams { + + setThreshold(0.5) def setThreshold(value: Double): this.type = set(threshold, value) - def setFeaturesCol(value: String): this.type = set(featuresCol, value) - def setScoreCol(value: String): this.type = set(scoreCol, value) - def setPredictionCol(value: String): this.type = set(predictionCol, value) - private[ml] override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { -validateAndTransformSchema(schema, paramMap, fitting = false) + private val margin: Vector = Double = (features) = { +BLAS.dot(features, weights) + intercept + } + + private val score: Vector = Double = (features) = { +val m = margin(features) +1.0 / (1.0 + math.exp(-m)) } override def transform(dataset: SchemaRDD, paramMap: ParamMap): SchemaRDD = { +// Check schema transformSchema(dataset.schema, paramMap, logging = true) + import dataset.sqlContext._ val map = this.paramMap ++ paramMap -val score: Vector = Double = (v) = { - val margin = BLAS.dot(v, weights) - 1.0 / (1.0 + math.exp(-margin)) + +// Output selected columns only. +// This is a bit complicated since it tries to avoid repeated computation. --- End diff -- Thinking more about this, I think abstracting the key links might be best. It will certainly make LogisticRegression much shorter since prediction takes up most of the file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request: [SPARK-5095] Support capping cores and launch ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4027#issuecomment-7002 [Test build #25580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25580/consoleFull) for PR 4027 at commit [`486d2f1`](https://github.com/apache/spark/commit/486d2f11ca278ed497d712a6adcbc41fa3a9400c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-70020007 [Test build #25581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25581/consoleFull) for PR 3861 at commit [`99415c3`](https://github.com/apache/spark/commit/99415c3bc9973f2f80faaf7f5742b3bc860bc900). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-70021585 @loachli @bgreeven We are thinking of changing the name of`ArtificialNeuralNetwork` and `ANNClassifier` objects to `ANNWithLBFGS` and `ANNClassifierWithLBFGS` to be in line with the naming convention in mllib. Are there any objections? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...
Github user ksakellis commented on a diff in the pull request: https://github.com/apache/spark/pull/4050#discussion_r22985033 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -219,6 +220,9 @@ class HadoopRDD[K, V]( val bytesReadCallback = if (split.inputSplit.value.isInstanceOf[FileSplit]) { SparkHadoopUtil.get.getFSBytesReadOnThreadCallback( split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf) + } else if (split.inputSplit.value.isInstanceOf[CombineFileSplit]) { +SparkHadoopUtil.get.getFSBytesReadOnThreadCallback( + split.inputSplit.value.asInstanceOf[CombineFileSplit].getPath(0), jobConf) --- End diff -- Yes, SparkHadoopUtil) can check for those classes. It can have a matcher on the 4 classes (2 new and 2 old). So the call from hadoopRdd would be something like: SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(split.inputSplit, jobConf) Not a big deal i guess since in SparkHadoopUtil you'll have four cases but at least that logic is centralized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r22975410 --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractLauncher.java --- @@ -0,0 +1,451 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.BufferedReader; +import java.io.File; +import java.io.FileFilter; +import java.io.FileInputStream; +import java.io.InputStreamReader; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Properties; +import java.util.jar.JarFile; +import java.util.regex.Pattern; + +/** + * Basic functionality for launchers. --- End diff -- This could use a little explanation. What is a launcher? When should someone consider extending this class? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r22976011 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherCommon.java --- @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.File; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +/** + * Configuration key definitions for Spark jobs, and some helper methods. + */ +public class LauncherCommon { + + /** The Spark master. */ + public static final String SPARK_MASTER = spark.master; + + /** Configuration key for the driver memory. */ + public static final String DRIVER_MEMORY = spark.driver.memory; + /** Configuration key for the driver class path. */ + public static final String DRIVER_EXTRA_CLASSPATH = spark.driver.extraClassPath; + /** Configuration key for the driver VM options. */ + public static final String DRIVER_EXTRA_JAVA_OPTIONS = spark.driver.extraJavaOptions; + /** Configuration key for the driver native library path. */ + public static final String DRIVER_EXTRA_LIBRARY_PATH = spark.driver.extraLibraryPath; + + /** Configuration key for the executor memory. */ + public static final String EXECUTOR_MEMORY = spark.executor.memory; + /** Configuration key for the executor class path. */ + public static final String EXECUTOR_EXTRA_CLASSPATH = spark.executor.extraClassPath; + /** Configuration key for the executor VM options. */ + public static final String EXECUTOR_EXTRA_JAVA_OPTIONS = spark.executor.extraJavaOptions; + /** Configuration key for the executor native library path. */ + public static final String EXECUTOR_EXTRA_LIBRARY_PATH = spark.executor.extraLibraryOptions; + /** Configuration key for the number of executor CPU cores. */ + public static final String EXECUTOR_CORES = spark.executor.cores; + + /** Returns whether the given string is null or empty. */ + protected static boolean isEmpty(String s) { +return s == null || s.isEmpty(); + } + + /** Joins a list of strings using the given separator. */ + protected static String join(String sep, String... elements) { +StringBuilder sb = new StringBuilder(); +for (String e : elements) { + if (e != null) { +if (sb.length() 0) { + sb.append(sep); +} +sb.append(e); + } +} +return sb.toString(); + } + + /** Joins a list of strings using the given separator. */ --- End diff -- Can this be replaced with Guava's `Joiner.on`? Or are we somehow avoiding Guava's inclusion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r22976317 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherCommon.java --- @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.File; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +/** + * Configuration key definitions for Spark jobs, and some helper methods. + */ +public class LauncherCommon { + + /** The Spark master. */ + public static final String SPARK_MASTER = spark.master; + + /** Configuration key for the driver memory. */ + public static final String DRIVER_MEMORY = spark.driver.memory; + /** Configuration key for the driver class path. */ + public static final String DRIVER_EXTRA_CLASSPATH = spark.driver.extraClassPath; + /** Configuration key for the driver VM options. */ + public static final String DRIVER_EXTRA_JAVA_OPTIONS = spark.driver.extraJavaOptions; + /** Configuration key for the driver native library path. */ + public static final String DRIVER_EXTRA_LIBRARY_PATH = spark.driver.extraLibraryPath; + + /** Configuration key for the executor memory. */ + public static final String EXECUTOR_MEMORY = spark.executor.memory; + /** Configuration key for the executor class path. */ + public static final String EXECUTOR_EXTRA_CLASSPATH = spark.executor.extraClassPath; + /** Configuration key for the executor VM options. */ + public static final String EXECUTOR_EXTRA_JAVA_OPTIONS = spark.executor.extraJavaOptions; + /** Configuration key for the executor native library path. */ + public static final String EXECUTOR_EXTRA_LIBRARY_PATH = spark.executor.extraLibraryOptions; + /** Configuration key for the number of executor CPU cores. */ + public static final String EXECUTOR_CORES = spark.executor.cores; + + /** Returns whether the given string is null or empty. */ + protected static boolean isEmpty(String s) { +return s == null || s.isEmpty(); + } + + /** Joins a list of strings using the given separator. */ + protected static String join(String sep, String... elements) { +StringBuilder sb = new StringBuilder(); +for (String e : elements) { + if (e != null) { +if (sb.length() 0) { + sb.append(sep); +} +sb.append(e); + } +} +return sb.toString(); + } + + /** Joins a list of strings using the given separator. */ --- End diff -- This library should not have any external dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-70007063 Jenkins, please re-test again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] Fix tiny typo in BlockManager
Github user squito commented on the pull request: https://github.com/apache/spark/pull/4046#issuecomment-70013728 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5217 Spark UI should report pending stag...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/4043#issuecomment-70015231 lgtm. I was going to suggest that pending stages should be sorted with oldest submission time first, not reversed ... but I guess we want the completed stages sorted with oldest last, and probably makes sense to keep those tables consistent with each other. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4707][STREAMING] Reliable Kafka Receive...
Github user harishreedharan commented on the pull request: https://github.com/apache/spark/pull/3655#issuecomment-70015327 No, this does prevent data loss - basically if the store fails multiple times, we shutdown the receiver completely. So the new receiver which gets started starts from the last commit, so we are safe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22983383 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.SparkContext._ +import org.apache.spark.ml.classification.{Classifier, ClassifierParams, ClassificationModel} +import org.apache.spark.ml.param.{Params, IntParam, ParamMap} +import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.sql.{DataType, SchemaRDD, Row, SQLContext} + +/** + * A simple example demonstrating how to write your own learning algorithm using Estimator, + * Transformer, and other abstractions. + * This mimics [[org.apache.spark.ml.classification.LogisticRegression]]. + * Run with + * {{{ + * bin/run-example ml.DeveloperApiExample + * }}} + */ +object DeveloperApiExample { + + def main(args: Array[String]) { +val conf = new SparkConf().setAppName(DeveloperApiExample) +val sc = new SparkContext(conf) +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Prepare training data. +val training = sparkContext.parallelize(Seq( + LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), + LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), + LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), + LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5 + +// Create a LogisticRegression instance. This instance is an Estimator. +val lr = new MyLogisticRegression() +// Print out the parameters, documentation, and any default values. +println(MyLogisticRegression parameters:\n + lr.explainParams() + \n) + +// We may set parameters using setter methods. +lr.setMaxIter(10) + +// Learn a LogisticRegression model. This uses the parameters stored in lr. +val model = lr.fit(training) + +// Prepare test data. +val test = sparkContext.parallelize(Seq( + LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)), + LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)), + LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5 + +// Make predictions on test data. +val sumPredictions: Double = model.transform(test) + .select('features, 'label, 'prediction) + .collect() + .map { case Row(features: Vector, label: Double, prediction: Double) = +prediction + }.sum +assert(sumPredictions == 0.0, + MyLogisticRegression predicted something other than 0, even though all weights are 0!) + } +} + +/** + * Example of defining a parameter trait for a user-defined type of [[Classifier]]. + * + * NOTE: This is private since it is an example. In practice, you may not want it to be private. + */ +private trait MyLogisticRegressionParams extends ClassifierParams { + + /** param for max number of iterations */ + val maxIter: IntParam = new IntParam(this, maxIter, max number of iterations) + def getMaxIter: Int = get(maxIter) +} + +/** + * Example of defining a type of [[Classifier]]. + * + * NOTE: This is private since it is an example. In practice, you may not want it to be private. + */ +private class MyLogisticRegression + extends Classifier[Vector, MyLogisticRegression, MyLogisticRegressionModel] + with MyLogisticRegressionParams { + + setMaxIter(100) // Initialize + + def setMaxIter(value: Int): this.type = set(maxIter, value) + + override def fit(dataset: SchemaRDD, paramMap: ParamMap): MyLogisticRegressionModel = { +// Check schema (types). This allows early failure before running the algorithm. +
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-70024523 By the way, I'm running larger-scale tests, and I'll post results once they are ready! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5094][MLlib] Add Python API for Gradien...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3951#discussion_r22972683 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -21,6 +21,8 @@ import java.io.OutputStream import java.nio.{ByteBuffer, ByteOrder} import java.util.{ArrayList = JArrayList, List = JList, Map = JMap} +import org.apache.spark.mllib.tree.loss.Losses --- End diff -- Organize imports, ordered as: scala/java, outside libraries, spark (alphabetized within groups) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5094][MLlib] Add Python API for Gradien...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3951#issuecomment-69998021 @kazk1018 Thanks for the PR! A few high-level items: * Will it reduce duplicate code to abstract the TreeEnsembleModel concept, as in Scala? Forests and boosting produce models which are very similar. GradientBoostedTreesModel and RandomForestModel could wrap the abstract class. * Default parameter values: You state default parameter values in the docs for trainClassifier/Regressor, but they are not actually set in the method declarations. Could you please fix that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r22975493 --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala --- @@ -19,11 +19,14 @@ package org.apache.spark.deploy.worker import java.io.{File, FileOutputStream, InputStream, IOException} import java.lang.System._ +import java.util.{ArrayList, List = JList, Map = JMap} --- End diff -- ArrayList seems to be unused --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2996] Implement userClassPathFirst for ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/3233#discussion_r22975503 --- Diff: yarn/src/test/resources/log4j.properties --- @@ -16,7 +16,7 @@ # # Set everything to be logged to the file target/unit-tests.log -log4j.rootCategory=INFO, file +log4j.rootCategory=DEBUG, file --- End diff -- Yes. It never made much sense to me to have test logs restricted to `INFO`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-70006013 [Test build #25570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25570/consoleFull) for PR 3833 at commit [`9cf9811`](https://github.com/apache/spark/commit/9cf98115c9b8ba76cd4b460e205ba87328c4e471). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org