[GitHub] spark pull request: [SPARK-1946] Submit tasks after (configured ra...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/900#discussion_r14280473 --- Diff: yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.cluster + +import org.apache.spark.SparkContext +import org.apache.spark.scheduler.TaskSchedulerImpl +import org.apache.spark.util.IntParam + +private[spark] class YarnClusterSchedulerBackend( +scheduler: TaskSchedulerImpl, +sc: SparkContext) + extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem) { + + override def start() { +super.start() +var numExecutors = 2 +if (sc.getConf.contains(spark.executor.instances)) { + numExecutors = sc.getConf.getInt(spark.executor.instances, 2) --- End diff -- (so you don't override it if spark.executor.instances is not already set) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2297][UI] Make task attempt and specula...
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/1236#issuecomment-47310502 Oh good point that makes sense On Thu, Jun 26, 2014 at 10:21 PM, Reynold Xin notificati...@github.com wrote: It's going to be useless if we stop using them in logs. I think right now they might still be useful since they can be used to correlate with log messages. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1236#issuecomment-47308828. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1946] Submit tasks after (configured ra...
Github user li-zhihui commented on a diff in the pull request: https://github.com/apache/spark/pull/900#discussion_r14280510 --- Diff: yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.cluster + +import org.apache.spark.SparkContext +import org.apache.spark.scheduler.TaskSchedulerImpl +import org.apache.spark.util.IntParam + +private[spark] class YarnClusterSchedulerBackend( +scheduler: TaskSchedulerImpl, +sc: SparkContext) + extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem) { + + override def start() { +super.start() +var numExecutors = 2 +if (sc.getConf.contains(spark.executor.instances)) { + numExecutors = sc.getConf.getInt(spark.executor.instances, 2) --- End diff -- Cool ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2304] tera sort example program for shu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1242#issuecomment-47310836 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2304] tera sort example program for shu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1242#issuecomment-47310838 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16196/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2186: Spark SQL DSL support for simple a...
Github user edrevo commented on a diff in the pull request: https://github.com/apache/spark/pull/1211#discussion_r14280656 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala --- @@ -108,6 +108,24 @@ package object dsl { implicit def symbolToUnresolvedAttribute(s: Symbol) = analysis.UnresolvedAttribute(s.name) +def sum(e: Expression) = Sum(e) +def sum(d: DistinctExpression) = SumDistinct(d.expression) --- End diff -- There's no implicitness going on in here, since the user needs to explicitly call both `sum` and `disctinct`. I have no problem changing it, though. Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47311175 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47311182 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2288] Hide ShuffleBlockManager behind S...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1241#issuecomment-47311880 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2288] Hide ShuffleBlockManager behind S...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1241#issuecomment-47311892 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47312756 Hey @andrewor14 I still need to fix some issues with reading the Hadoop file size pointed out by @pwendell and also update the UI to show the DataReadMethod; will finish this tomorrow (just wanted to let you know so you don't waste time looking at this before I'm done). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2259] Fix highly misleading docs on clu...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1200#issuecomment-47313462 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: support for Kinesis
Github user venuktan commented on the pull request: https://github.com/apache/spark/pull/223#issuecomment-47313539 Hi Parviz, Is there a package in maven repo called spark-amazonkinesis-asl now ? If not how do I use this package ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2259] Fix highly misleading docs on clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1200#issuecomment-47313574 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2259] Fix highly misleading docs on clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1200#issuecomment-47313581 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Resolve sbt warnings during build � �
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1153#issuecomment-47314418 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16197/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47314415 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2288] Hide ShuffleBlockManager behind S...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1241#issuecomment-47314420 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16199/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47314419 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16198/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2288] Hide ShuffleBlockManager behind S...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1241#issuecomment-47314417 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Resolve sbt warnings during build � �
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1153#issuecomment-47314416 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
GitHub user BaiGang opened a pull request: https://github.com/apache/spark/pull/1243 [MLLIB] SPARK-2303: Poisson regression model for count data This pull request includes the implementations of Poisson regression in mllib.regression for modeling count data. In detail, it includes: 1. The gradient of the negative log-likelihood of Poisson regression model. 2. The implementations of PoissonRegressionModel, including the generalized linear algorithm class which uses L-BFGS and SGD for parameter estimation respectively and the companion objects. 3. The test suites * the gradient/loss computation * the regression method using LBFGS optimization on generated data set * the regression method using LBFGS optimization on real-world data set * the regression method using SGD optimization on generated data set * the regression method using SGD optimization on real-world data set 4. a Poisson regression data generator in mllib/util for producing the test data. JIRA: https://issues.apache.org/jira/browse/SPARK-2303 You can merge this pull request into a Git repository by running: $ git pull https://github.com/BaiGang/spark poisson Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1243.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1243 commit abf543d3f36a02e5dbbad797ff8f84c043855469 Author: Gang Bai m...@baigang.net Date: 2014-06-27T05:35:24Z The implementations of Poission regression in mllib/regression. It includes 1)the gradient of the negative log-likelihood, 2)the implementation of PoissonRegressionModel, the generalized linear algorithm class which uses L-BFGS and SGD for parameter estimation respectively, 3) the test suites for the gradient/loss computation, the regression method on generated and real-world data set, and 4) a Poisson regression data generator in mllib/util for producing the test data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47315401 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47315564 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16201/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47315563 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2288] Hide ShuffleBlockManager behind S...
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1241#discussion_r14282384 --- Diff: core/src/main/scala/org/apache/spark/shuffle/ShuffleBlockManager.scala --- @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle + +import org.apache.spark.storage.{FileSegment, ShuffleBlockId} +import java.nio.ByteBuffer + +private[spark] +trait ShuffleBlockManager { --- End diff -- @rxin, How about we also hide current BlockFetcherIterator kind of thing behind shuffleManager. since a specific shuffleManager not necessary using current fetcher approaching to get shuffle data. Each shuffleManager should instance his own shuffle logic, while some could reuse the same logic, say FileBased one could reuse current implementation. By this way, we can solve the above problem and have better chance to not expose shuffleBlockManager, say a read/write interface for shuffle reader/writter is enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user BaiGang commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47316481 Fixed scalastyle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47316522 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47316533 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2126: Move MapOutputTracker behind Shuff...
Github user colorant commented on the pull request: https://github.com/apache/spark/pull/1240#issuecomment-47316658 I guess, the idea to put mapOutput tracker behind shuffleManager is not just make it a shufflemanager's member and still call this member's function from DAGScheduler side? the external interface should probably been reduced to minimum if not possible to completely hide, most of logics should be handle within shuffleManager itself. Of course this could not be done without change to shuffle fetcher etc. Just my thought, might not be correct ;) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2159: Add support for stopping SparkCont...
Github user adamosloizou commented on a diff in the pull request: https://github.com/apache/spark/pull/1230#discussion_r14282766 --- Diff: repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -597,7 +597,13 @@ class SparkILoop(in0: Option[BufferedReader], protected val out: JPrintWriter, if (!awaitInitialized()) return false runThunks() } - if (line eq null) false // assume null means EOF + /* Stop loop if: --- End diff -- Thanks for the nit. Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1244 [SPARK-2290] Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1244 commit 05c3a789a00996a5502b78711b44d80e8812fdbb Author: hakeemzhai hakeemzhai@hakeemzhai.(none) Date: 2014-06-27T07:42:18Z [SPARK-2290] Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1244#issuecomment-47317532 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: feature/glm
Github user BaiGang commented on the pull request: https://github.com/apache/spark/pull/1237#issuecomment-47317665 Oops! I didn't notice this one. Created https://github.com/apache/spark/pull/1243 just now. We actually implemented exactly the same idea of Poisson regression, with only some tiny difference on calculating the gradient of the negative log-likelihood and the test suites. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: feature/glm
Github user BaiGang commented on a diff in the pull request: https://github.com/apache/spark/pull/1237#discussion_r14283042 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala --- @@ -175,3 +175,80 @@ class HingeGradient extends Gradient { } } } + +/** + * :: DeveloperApi :: + * Compute gradient and loss for MLE of Poisson Regression with log link function. + * The gradient is calculated as follows: + *f' = x_i*(exp(x_i*w)-y_i) + */ +@DeveloperApi +class PoissonGradient extends Gradient { + def fact(n: Int): Int = +(1 to n).foldLeft(1) { _ * _ } + + override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = { +val brzData = data.toBreeze +val brzWeights = weights.toBreeze +val dotProd = brzWeights.dot(brzData) +val diff = math.exp(dotProd) - label +val loss = -dotProd * label + math.exp(dotProd) + fact(label.toInt) --- End diff -- We can safely remove the fact(.) part, because it has virtually nothing to do with the resulted weights. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1244#issuecomment-47318509 If we are going to remove this feature, we should just take the sparkHome field out of `ApplicationDescription` entirely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2159: Add support for stopping SparkCont...
Github user adamosloizou commented on the pull request: https://github.com/apache/spark/pull/1230#issuecomment-47319385 @vanzin great catch! Unfortunately, it will not work with this patch as it captures the `exit` before it passes it down to the evaluation section: ``` scala val exit = 1 exit: Int = 1 scala exit Stopping spark context. ``` From a quick look, it seems to be non-trivial to intercept the `exit` evaluation at a lower level. The patch seems to only subvert single line evals of `exit`: ``` scala :paste // Entering paste mode (ctrl-D to finish) val exit = 1 exit // Exiting paste mode, now interpreting. exit: Int = 1 res0: Int = 1 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47319699 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47319669 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] SPARK-2303: Poisson regression model f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1243#issuecomment-47319671 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16202/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47319693 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-47322622 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-47322632 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-47322779 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-47322780 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16204/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2104] Fix task serializing issues when ...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/1245 [SPARK-2104] Fix task serializing issues when sort with Java non serializable class Details can be see in [SPARK-2104](https://issues.apache.org/jira/browse/SPARK-2104). This work is based on Reynold's work, add some unit tests to validate the issue. @rxin , would you please take a look at this PR, thanks a lot. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-2104 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1245 commit 47d763cc817dc1fe05e7caf1bf8357a5c427a256 Author: jerryshao saisai.s...@intel.com Date: 2014-06-27T08:23:21Z Fix task serializing issue when sort with Java non serializable class commit 2b41917714dc2c33c5cf0d544945a8a651360c2b Author: jerryshao saisai.s...@intel.com Date: 2014-06-27T09:14:26Z Minor changes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2104] Fix task serializing issues when ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1245#issuecomment-47324256 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2104] Fix task serializing issues when ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1245#issuecomment-47324241 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/1151#discussion_r14285252 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala --- @@ -136,13 +137,12 @@ class SqlParser extends StandardTokenParsers with PackratParsers { } } - protected lazy val query: Parser[LogicalPlan] = ( -select * ( -UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) = Union(q1, q2) } | -UNION ~ opt(DISTINCT) ^^^ { (q1: LogicalPlan, q2: LogicalPlan) = Distinct(Union(q1, q2)) } - ) -| insert | cache - ) + protected lazy val query: Parser[LogicalPlan] = + select * ( + UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) = Union(q1, q2)} | + UNION ~ opt(DISTINCT) ^^^ { (q1: LogicalPlan, q2: LogicalPlan) = Distinct(Union(q1, q2))} | --- End diff -- Hi @YanjieGao, Jenkins says a scalastyle error exists here, which is File line length exceeds 100 characters. You have to format code around this line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47327832 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16203/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47327831 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2104] Fix task serializing issues when ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1245#issuecomment-47327948 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2104] Fix task serializing issues when ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1245#issuecomment-47327950 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16205/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user ueshin commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47328485 Passed Hive tests. Why? Just merged master. And Python tests failed... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-47329052 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-47329173 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16206/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1151#issuecomment-47329172 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] SPARK-2126: Move MapOutputTracker behind...
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/1240#issuecomment-47331112 @colorant yes, the current version is definitely not a perfect one; I'm also aware of that function calls like ``` shuffleManager.mapOutputTracker.xxx ``` is ..not clean the reason I hesitate to make further refactoring is that, I'm not sure if we really want to make ShuffleManager to know anything about the stuffs in other domain (e.g. Executor, which is supposed to be a scheduling stuff and would possibly be introduced to shuffleManager if we want to do everything with it instead of in DAGScheduler) in that case, I'm afraid in future, we will fall into the same situation which we are facing in DAGScheduler now...(DAGScheduler knows everything, from task level to DAG level) any suggestion? also @pwendell @markhamstra --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1946] Submit tasks after (configured ra...
Github user li-zhihui commented on the pull request: https://github.com/apache/spark/pull/900#issuecomment-47331430 @tgravescs @kayousterhout I move waitBackendReady back to submitTasks method, because it (waitBackendReady in start method) dose not work on yarn-cluster mode (NullPointException because SparkContext initialize timeout) (yarn-client is ok). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix for SPARK-2228
Github user kellrott commented on the pull request: https://github.com/apache/spark/pull/1182#issuecomment-47335004 My problem code started up calling SVMWithSGD.train in several parallel threads. This matches with your notes about events being generated too fast for the listener. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1890 and SPARK-1891- add admin and modif...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1196#issuecomment-47341211 You can't determine the user unless some sort of authentication filter is in place. the UI returns null in that case. You can't check acls against a null user so all you can do is assume its either on or off. Since an authentication filter could choose to not filter all web UI pages, some may come back with a user and some may not. That is why we assume if there is no user everyone has access. The only way I would see around that would be to build in some sort config with a real list. We could also change this behavior for say CLI interfaces, if we want them to do something different then the web ui interfaces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1890 and SPARK-1891- add admin and modif...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/1196#discussion_r14291718 --- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala --- @@ -169,18 +192,43 @@ private[spark] class SecurityManager(sparkConf: SparkConf) extends Logging { ) } - private[spark] def setViewAcls(defaultUsers: Seq[String], allowedUsers: String) { -viewAcls = (defaultUsers ++ allowedUsers.split(',')).map(_.trim()).filter(!_.isEmpty).toSet + /** + * Split a comma separated String, filter out any empty items, and return a Set of strings + */ + private def stringToSet(list: String): Set[String] = { +(list.split(',')).map(_.trim()).filter(!_.isEmpty).toSet + } + + private[spark] def setViewAcls(defaultUsers: Set[String], allowedUsers: String) { +viewAcls = (adminAcls ++ defaultUsers ++ stringToSet(allowedUsers)) logInfo(Changing view acls to: + viewAcls.mkString(,)) } private[spark] def setViewAcls(defaultUser: String, allowedUsers: String) { -setViewAcls(Seq[String](defaultUser), allowedUsers) +setViewAcls(Set[String](defaultUser), allowedUsers) + } + + private[spark] def getViewAcls: String = viewAcls.mkString(,) + + private[spark] def setModifyAcls(defaultUsers: Set[String], allowedUsers: String) { +modifyAcls = (adminAcls ++ defaultUsers ++ stringToSet(allowedUsers)) --- End diff -- yes it requires it set before. I went back and forth on this a bit and choose to keep it this way since its private and only really called in once place at this point (history ui).And actually only the view one is called the modify one isn't called anywhere outside of this class. We could add the additional logic but I kind of see it as just overhead right now. Normally everything is initialized just when you create the securityManager and so these routines aren't called outside of here. I could be swayed to change it. I should atleast add a comment here also. I have it in some other places, but should add here too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47370415 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47370437 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47370627 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2159: Add support for stopping SparkCont...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/1230#issuecomment-47372033 I think it's unlikely that people are redefining exit() in the shell or using exit as a variable name; but just for completeness, you can leave the shell by typing `:quit`. (btw, if `exit()` is an alias to `System.exit()` or something, maybe registering a shutdown hook would suffice?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Loading spark-defaults.conf when creatin...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1233#issuecomment-47372715 @vanzin The situation is `sbin/start-*.sh` are not support `spark-defaults.conf`. eg: `sbin/start-history-server.sh` cannot load the`spark.history.fs.logDirectory` configuration from `spark-defaults.conf`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Loading spark-defaults.conf when creatin...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/1233#issuecomment-47373161 Ah, so it's SPARK-2098. I think it's a nice feature to have (I filed the bug after all), but we can't break the existing semantics. For daemons, the command line parsers could do that (by having a --properties-file argument similar to spark-submit). But if you want to support arbitrary SparkConf instances to read these conf files, it will become trickier, since now you need to propagate that command line information somehow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47373233 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47373251 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47373412 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16208/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47373409 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Loading spark-defaults.conf when creatin...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1233#issuecomment-47373587 You're right, the corresponding code should be submitted at the weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47373781 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47373794 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47374032 Ok this is good to go now I think. Two changes: (1) As @andrewor14 suggested, I added the read method to the UI as shown in the image below ![image](https://cloud.githubusercontent.com/assets/1108612/3414962/8aa67a3c-fe1b-11e3-92e9-d21afa43be78.png) (2) I changed the DataReadMethod name from Hdfs to Hadoop, since @pwendell pointed out that data won't necessarily have come from Hdfs @pwendell also recommended checking the class name of the Hadoop input split before trying to set the input metrics to ensure that the type of split supports the getLength() method, because some split types (e.g., the HBase one) just return 0 when you call getLength(). I looked into this a little bit and there doesn't seem to be a good way to predict when an InputSplit subclass will return an accurate value for getLength() (@pwendell's original suggestion of checking to see if the class name ends with FileSplit is too restrictive because CompositeInputSplit accurately returns the length). I think it's fine to leave this as-as because if the InputSplit subclass used returns 0 from getLength(), the total input size for the stage will be 0, so we won't show the input size in the UI. As a result, I don't think this will be confusing to users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47376027 test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47376498 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47376478 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2003] Fix python SparkContext example
GitHub user mattf opened a pull request: https://github.com/apache/spark/pull/1246 [SPARK-2003] Fix python SparkContext example You can merge this pull request into a Git repository by running: $ git pull https://github.com/mattf/spark SPARK-2003 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1246 commit b12e7ca2609d4597f8cb6f14dc0610a563807b3e Author: Matthew Farrellee m...@redhat.com Date: 2014-06-27T17:20:45Z [SPARK-2003] Fix python SparkContext example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2003] Fix python SparkContext example
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1246#issuecomment-47377085 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47378607 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47378608 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16209/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Strip '@' symbols when merging pull requests.
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1239#issuecomment-47379454 Yesss, thank you, great idea --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/962#discussion_r14305330 --- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala --- @@ -87,6 +93,29 @@ private[spark] object TaskMetrics { def empty: TaskMetrics = new TaskMetrics } +/** + * :: DeveloperApi :: + * Method by which input data was read. Network means that the data was read over the network + * from a remote block manager (which may have stored the data on-disk or in-memory). + */ +@DeveloperApi +private[spark] object DataReadMethod extends Enumeration with Serializable { --- End diff -- If it's private spark it doesn't have to be developer api --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/1195#discussion_r14305374 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala --- @@ -58,6 +66,11 @@ private[streaming] class BlockGenerator( @volatile private var currentBuffer = new ArrayBuffer[Any] @volatile private var stopped = false + private var currentBlockId: StreamBlockId = StreamBlockId(receiverId, +clock.currentTime() - blockInterval) + + // Removes might happen from the map while other threads are inserting. --- End diff -- If this is true then shouldn't it be a ConcurrentHashMap instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/1195#discussion_r14305438 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala --- @@ -48,6 +49,13 @@ private[streaming] class BlockGenerator( private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any]) + /** + * Internal representation of a callback function and its argument. + * @param function - The callback function --- End diff -- nit: add empty line before first `@param` (here and in other doc comments). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/962#discussion_r14305460 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala --- @@ -209,8 +224,11 @@ private[ui] class StagePage(parent: JobProgressTab) extends WebUIPage(stage) { } } - def taskRow(shuffleRead: Boolean, shuffleWrite: Boolean, bytesSpilled: Boolean) - (taskData: TaskUIData): Seq[Node] = { + def taskRow( +hasInput: Boolean, +hasShuffleRead: Boolean, +hasShuffleWrite: Boolean, +hasBytesSpilled: Boolean)(taskData: TaskUIData): Seq[Node] = { --- End diff -- nit: sorry you need to indent these by 2 more spaces --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47380451 Hi @kayousterhout, pending minor changes this LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/962#issuecomment-47381263 In response to your comment, actually if the bytesRead is 0, you still display `0 bytes (hadoop)`, because the code currently sets the `InputMetrics` no matter what. This is probably fine though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47381774 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16210/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2287] [SQL] Make ScalaReflection be abl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1226#issuecomment-47381773 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/962#discussion_r14305981 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -112,6 +113,15 @@ class NewHadoopRDD[K, V]( split.serializableHadoopSplit.value, hadoopAttemptContext) reader.initialize(split.serializableHadoopSplit.value, hadoopAttemptContext) + val inputMetrics = new InputMetrics(DataReadMethod.Hadoop) + try { +inputMetrics.bytesRead = split.serializableHadoopSplit.value.getLength() + } catch { +case e: Exception = + logWarning(Unable to get input split size in order to set task input bytes, e) + } + context.taskMetrics.inputMetrics = Some(inputMetrics) --- End diff -- same --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/1195#discussion_r14306024 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala --- @@ -48,6 +49,13 @@ private[streaming] class BlockGenerator( private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any]) + /** + * Internal representation of a callback function and its argument. + * @param function - The callback function + * @param arg - Argument to pass to pass to the function + */ + private class Callback(val function: Any = Unit, val arg: Any) --- End diff -- I don't know, this type feels weird to me. It feels like the the closure itself should encapsulate any local data it needs, and any arguments here should only be the ones that the caller of the callback is passing. e.g.: * if BlockGenerator does not pass any arguments to the callback, the callback signature should be () = Unit * if it passes a String, the signature should be String = Unit In the call site, if the closure needs other data, that data can exist locally and doesn't need to be known by this code, something along the lines of: val somethingLocal = foo bm.store(i, () = { println(somethingLocal) }) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1056#issuecomment-47382782 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1056#issuecomment-47382793 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1683] Track task read metrics.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/962#discussion_r14305949 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -196,6 +197,17 @@ class HadoopRDD[K, V]( context.addOnCompleteCallback{ () = closeIfNeeded() } val key: K = reader.createKey() val value: V = reader.createValue() + + // Set the task input metrics. + val inputMetrics = new InputMetrics(DataReadMethod.Hadoop) + try { +inputMetrics.bytesRead = split.inputSplit.value.getLength() + } catch { +case e: java.io.IOException = + logWarning(Unable to get input size to set InputMetrics for task, e) + } + context.taskMetrics.inputMetrics = Some(inputMetrics) --- End diff -- Actually, now that we display the read method on the UI, we should set this only if `bytesRead` exists (in the try block). Otherwise we end up with a bunch of `0 bytes (memory)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user harishreedharan commented on a diff in the pull request: https://github.com/apache/spark/pull/1195#discussion_r14305990 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala --- @@ -58,6 +66,11 @@ private[streaming] class BlockGenerator( @volatile private var currentBuffer = new ArrayBuffer[Any] @volatile private var stopped = false + private var currentBlockId: StreamBlockId = StreamBlockId(receiverId, +clock.currentTime() - blockInterval) + + // Removes might happen from the map while other threads are inserting. --- End diff -- Not really. We have to protect against other threads having a reference to the ArrayBuffer corresponding to each block id. Specifically if a thread is in the store method and is adding values to the buffer, and another thread calls remove() on the same block id from the map - the buffer could still be changing while the 2nd thread is calling the callbacks. To prevent this, any operation on the buffer and removal from the map should be protected by the same lock. So any += calls to the buffer and any removes from the map should be synchronized. This ensures that there is no thread holding onto a reference of the buffer instance while the buffer is being removed from the map. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/1195#discussion_r14308000 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala --- @@ -58,6 +66,11 @@ private[streaming] class BlockGenerator( @volatile private var currentBuffer = new ArrayBuffer[Any] @volatile private var stopped = false + private var currentBlockId: StreamBlockId = StreamBlockId(receiverId, +clock.currentTime() - blockInterval) + + // Removes might happen from the map while other threads are inserting. --- End diff -- Ah, I missed the synchronized in the `store()` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/1195#discussion_r14308113 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala --- @@ -58,6 +66,11 @@ private[streaming] class BlockGenerator( @volatile private var currentBuffer = new ArrayBuffer[Any] @volatile private var stopped = false + private var currentBlockId: StreamBlockId = StreamBlockId(receiverId, +clock.currentTime() - blockInterval) + + // Removes might happen from the map while other threads are inserting. --- End diff -- BTW, given your explanation, the comment itself seems a little out of place, since it doesn't really explain much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---