Re: [VOTE] Release Apache Spark 0.9.1 (RC3)
https://github.com/apache/spark/blob/master/docs/quick-start.md in line 127. one spelling mistake found please correct it. (proogram to program) On Fri, Mar 28, 2014 at 9:58 PM, Will Benton wi...@redhat.com wrote: RC3 works with the applications I'm working on now and MLLib performance is indeed perceptibly improved over 0.9.0 (although I haven't done a real evaluation). Also, from the downstream perspective, I've been tracking the 0.9.1 RCs in Fedora and have no issues to report there either: http://koji.fedoraproject.org/koji/buildinfo?buildID=507284 [x] +1 Release this package as Apache Spark 0.9.1 [ ] -1 Do not release this package because ...
a weird test case in Streaming
Hi, all The recovery with file input stream” in the Streaming.CheckpointSuite sometimes failed even you are working on a totally irrelevant part, I met this problem for 3+ times. I assume this test case is likely to fail when the testing servers are very busy? Two cases from others: Sean: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13561/ Mark: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13531/ Best, -- Nan Zhu
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954345#comment-13954345 ] ASF GitHub Bot commented on SPARK-732: -- Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39002053 Merged build finished. Build is starting -or- tests failed to complete. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
Guys I fixed this by adding j...@apache.org to the mailing list, no more moderation required. Cheers, Chris -Original Message- From: ASF GitHub Bot (JIRA) j...@apache.org Reply-To: dev@spark.apache.org dev@spark.apache.org Date: Saturday, March 29, 2014 10:14 AM To: dev@spark.apache.org dev@spark.apache.org Subject: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates [ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.pl ugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954345#comm ent-13954345 ] ASF GitHub Bot commented on SPARK-732: -- Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39002053 Merged build finished. Build is starting -or- tests failed to complete. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9 d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954353#comment-13954353 ] ASF GitHub Bot commented on SPARK-732: -- Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39002548 Merged build triggered. Build is starting -or- tests failed to complete. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954354#comment-13954354 ] ASF GitHub Bot commented on SPARK-732: -- Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39002551 Merged build started. Build is starting -or- tests failed to complete. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954363#comment-13954363 ] ASF GitHub Bot commented on SPARK-732: -- Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/228#discussion_r11094329 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -789,6 +799,25 @@ class DAGScheduler( } /** + * detect the duplicate accumulator value and save the accumulator values + * @param accumValue the accumulator values received from the task + * @param stage the stage which the task belongs to + * @param task the completed task + */ + private def saveAccumulatorValue(accumValue: Map[Long, Any], stage: Stage, task: Task[_]) { +if (accumValue != null + (!stageIdToAccumulators.contains(stage.id) || +!stageIdToAccumulators(stage.id).contains(task.partitionId))) { + stageIdToAccumulators.getOrElseUpdate(stage.id, +new HashMap[Int, ListBuffer[(Long, Any)]]). +getOrElseUpdate(task.partitionId, new ListBuffer[(Long, Any)]) + for ((id, value) - accumValue) { +stageIdToAccumulators(stage.id)(task.partitionId) += id - value --- End diff -- nit: you can avoid the lookup within the loop by using the value returned in the getOrElseUpdate calls. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954368#comment-13954368 ] ASF GitHub Bot commented on SPARK-732: -- Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39003471 looks good @CodingCat ! just made a few minor points. excellent change !! Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
JIRA. github and asf updates
Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
Re: JIRA. github and asf updates
Mridul, You can unsubscribe yourself from any of these sources, right? - Patrick On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.comwrote: Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954379#comment-13954379 ] ASF GitHub Bot commented on SPARK-732: -- Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39004143 Build is starting -or- tests failed to complete. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13579/ Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954384#comment-13954384 ] ASF GitHub Bot commented on SPARK-732: -- Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39004325 Can we just add the accumulator update to TaskSetManager, in the handleSuccessfulTask() method? This seems much simpler because the TaskSetManager already has all of the state about which tasks are running, which ones have been resubmitted or speculated, etc. I think this change would be much simpler. Over time, a lot of functionality has leaked into the DAGScheduler, such that there's a lot of state that's kept in multiple places: in the DAGScheduler and in the TaskSetManager or the TaskSchedulerImpl. The abstraction is supposed to be that the DAGScheduler handles the high level semantics of scheduling stages and dealing with inter-stage dependencies, and the TaskSetManager handles the low-level details of the tasks for each stage. There are some parts of this abstraction that are currently broken (where the DAGScheduler knows too much about task-level details) and refactoring this is on my todo list, but in the meantime I think we should try not to make this problem any worse, because it makes the code much more complicated, more difficult to understand, and buggy. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
Could you undo the JIRA dev list e-mails?
Hey Chris, I don't think our JIRA has been fully migrated to Apache infra, so it's really confusing to send people e-mails referring to the new JIRA since we haven't announced it yet. There is some content there because we've been trying to do the migration, but I'm not sure it's entirely finished. Also, right now our github comments go to a commits@ list. I'm actually -1 copying all of these to JIRA because we do a bunch of review level comments that are going to pollute the JIRA a bunch. In any case, can you revert the change whatever it was that sent these to the dev list? We should have a coordinated plan about this transition and the e-mail changes we plan to make. - Patrick
Re: JIRA. github and asf updates
With the speed of comments updates in Jira by Spark dev community +1 for issues@ list - Henry On Saturday, March 29, 2014, Patrick Wendell pwend...@gmail.com wrote: Ah sorry I see - Jira updates are going to the dev list. Maybe that's not desirable. I think we should send them to the issues@ list. On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.comjavascript:; wrote: Mridul, You can unsubscribe yourself from any of these sources, right? - Patrick On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.comjavascript:; wrote: Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
Re: Could you undo the JIRA dev list e-mails?
Okay I think I managed to revert this by just removing jira@a.o from our dev list. On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell pwend...@gmail.comwrote: Hey Chris, I don't think our JIRA has been fully migrated to Apache infra, so it's really confusing to send people e-mails referring to the new JIRA since we haven't announced it yet. There is some content there because we've been trying to do the migration, but I'm not sure it's entirely finished. Also, right now our github comments go to a commits@ list. I'm actually -1 copying all of these to JIRA because we do a bunch of review level comments that are going to pollute the JIRA a bunch. In any case, can you revert the change whatever it was that sent these to the dev list? We should have a coordinated plan about this transition and the e-mail changes we plan to make. - Patrick
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954423#comment-13954423 ] ASF GitHub Bot commented on SPARK-732: -- Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/228#issuecomment-39007448 Hi, Kay, I will think about it, and see if we can move accumulator related functionalities to tm entirely. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Apache Spark Issue Type: Bug Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2 Reporter: Josh Rosen Assignee: Nan Zhu Fix For: 1.0.0 Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: JIRA. github and asf updates
If the PR comments are going to be replicated into the jira's and they are going to be set to dev@, then we could keep that and remove [Github] updates ? The last was added since discussions were happening off apache lists - which should be handled by the jira updates ? I dont mind the mails if they had content - this is just duplication of the same message in three mails :-) Btw, this is a good problem to have - a vibrant and very actively engaged community generated a lot of meaningful traffic ! I just dont want to get distracted from it by repetitions. Regards, Mridul On Sat, Mar 29, 2014 at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote: Ah sorry I see - Jira updates are going to the dev list. Maybe that's not desirable. I think we should send them to the issues@ list. On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.comwrote: Mridul, You can unsubscribe yourself from any of these sources, right? - Patrick On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.comwrote: Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
Re: JIRA. github and asf updates
I'm working with infra to get the following set-up: 1. Don't post github updates to jira comments (they are too low level). If users want these they can subscribe to commits@s.a.o. 2. Jira comment stream will go to issues@s.a.o so people can opt into that. One thing YARN has set-up that might be cool in the future is to e-mail *new* JIRA's to the dev list. That might be cool to set up in the future. On Sat, Mar 29, 2014 at 1:15 PM, Mridul Muralidharan mri...@gmail.comwrote: If the PR comments are going to be replicated into the jira's and they are going to be set to dev@, then we could keep that and remove [Github] updates ? The last was added since discussions were happening off apache lists - which should be handled by the jira updates ? I dont mind the mails if they had content - this is just duplication of the same message in three mails :-) Btw, this is a good problem to have - a vibrant and very actively engaged community generated a lot of meaningful traffic ! I just dont want to get distracted from it by repetitions. Regards, Mridul On Sat, Mar 29, 2014 at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote: Ah sorry I see - Jira updates are going to the dev list. Maybe that's not desirable. I think we should send them to the issues@ list. On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.com wrote: Mridul, You can unsubscribe yourself from any of these sources, right? - Patrick On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.com wrote: Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
Re: Could you undo the JIRA dev list e-mails?
No worries, thanks Patrick, agreed. -Original Message- From: Patrick Wendell pwend...@gmail.com Date: Saturday, March 29, 2014 1:47 PM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov Cc: Chris Mattmann mattm...@apache.org, dev@spark.apache.org dev@spark.apache.org Subject: Re: Could you undo the JIRA dev list e-mails? Okay cool - sorry about that. Infra should be able to migrate these over to an issues@ list shortly. I'd rather bother a few moderators than the entire dev list... but ya I realize it's annoying :/ On Sat, Mar 29, 2014 at 1:22 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I reverted this Patrick, per your request: [hermes] 8:21pm spark.apache.org http://spark.apache.org ezmlm-list dev | grep jira j...@apache.org [hermes] 8:21pm spark.apache.org http://spark.apache.org ezmlm-unsub dev j...@apache.org [hermes] 8:21pm spark.apache.org http://spark.apache.org ezmlm-list dev | grep jira [hermes] 8:21pm spark.apache.org http://spark.apache.org Note, that I an other moderators will now receive moderation emails until the infra ticket is fixed but others will not. I'll set up a mail filter. Chris -Original Message- From: Mattmann, Chris Mattmann chris.a.mattm...@jpl.nasa.gov Date: Saturday, March 29, 2014 1:11 PM To: Patrick Wendell pwend...@gmail.com, Chris Mattmann mattm...@apache.org Cc: dev@spark.apache.org dev@spark.apache.org Subject: Re: Could you undo the JIRA dev list e-mails? Patrick, No problem -- at the same time realize that I and the other moderators were getting spammed by moderation emails from JIRA, so you should take that into consideration as well. Cheers, Chris -Original Message- From: Patrick Wendell pwend...@gmail.com Date: Saturday, March 29, 2014 11:59 AM To: Chris Mattmann mattm...@apache.org Cc: d...@spark.incubator.apache.org d...@spark.incubator.apache.org Subject: Re: Could you undo the JIRA dev list e-mails? Okay I think I managed to revert this by just removing jira@a.o from our dev list. On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Chris, I don't think our JIRA has been fully migrated to Apache infra, so it's really confusing to send people e-mails referring to the new JIRA since we haven't announced it yet. There is some content there because we've been trying to do the migration, but I'm not sure it's entirely finished. Also, right now our github comments go to a commits@ list. I'm actually -1 copying all of these to JIRA because we do a bunch of review level comments that are going to pollute the JIRA a bunch. In any case, can you revert the change whatever it was that sent these to the dev list? We should have a coordinated plan about this transition and the e-mail changes we plan to make. - Patrick
Re: Travis CI
Hi, Is the migration from Jenkins to Travis finished? I think Travis is actually not stable based on the observations in these days (and Jenkins becomes unstable too…… :-( ), I’m actively working on two PRs related to DAGScheduler, I saw Problem on Travis: 1. test “large number of iterations” in BagelSuite sometimes failed, because it doesn’t output anything within 10 seconds 2. hive/test usually aborted because it doesn’t output anything within 10 minutes 3. a test case in Streaming.CheckpointSuite failed 4. hive/test didn’t finish in 50 minutes, and was aborted Problem on Jenkins: 1. didn’t finish in 90mins, and the process is aborted 2. the same as 3 in Travis problem Some of these problems once appeared in Jenkins months, but not so often I’m not complaining, I know that the admins are working hard to make the community run in a good condition on every aspect, I’m just reporting what I saw and hope that can help you to identify the problem Thank you -- Nan Zhu On Tuesday, March 25, 2014 at 10:11 PM, Patrick Wendell wrote: Ya It's been a little bit slow lately because of a high error rate in interactions with the git-hub API. Unfortunately we are pretty slammed for the release and haven't had a ton of time to do further debugging. - Patrick On Tue, Mar 25, 2014 at 7:13 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: I just found that the Jenkins is not working from this afternoon for one PR, the first time build failed after 90 minutes, the second time it has run for more than 2 hours, no result is returned Best, -- Nan Zhu On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote: That's not correct - like Michael said the Jenkins build remains the reference build for now. On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: I assume the Jenkins is not working now? Best, -- Nan Zhu On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote: Just a quick note to everyone that Patrick and I are playing around with Travis CI on the Spark github repository. For now, travis does not run all of the test cases, so will only be turned on experimentally. Long term it looks like Travis might give better integration with github, so we are going to see if it is feasible to get all of our tests running on it. *Jenkins remains the reference CI and should be consulted before merging pull requests, independent of what Travis says.* If you have any questions or want to help out with the investigation, let me know! Michael
Re: Travis CI
Is the migration from Jenkins to Travis finished? It is not finished and really at this point it is only something we are considering, not something that will happen for sure. We turned it on in addition to Jenkins so that we could start finding issues exactly like the ones you described below to determine if Travis is going to be a viable option. Basically it seems to me that the Travis environment is a little less predictable (probably because of virtualization) and this is pointing out some existing flakey-ness in the tests If there are tests that are regularly flakey we should probably file JIRAs so they can be fixed or switched off. If you have seen a test fail 2-3 times and then pass with no changes, I'd say go ahead and file an issue for it (others should feel free to chime in if we want some other process here) A few more specific comments inline below. 2. hive/test usually aborted because it doesn't output anything within 10 minutes Hmm, this is a little confusing. Do you have a pointer to this one? Was there any other error? 4. hive/test didn't finish in 50 minutes, and was aborted Here I think the right thing to do is probably break the hive tests in two and run them in parallel. There is already machinery for doing this, we just need to flip the options on in the travis.yml to make it happen. This is only going to get more critical as we whitelist more hive tests. We also talked about checking the PR and skipping the hive tests when there have been no changes in catalyst/sql/hive. I'm okay with this plan, just need to find someone with time to implement it
Re: Migration to the new Spark JIRA
That’s great! Andy, thank you for all your contributions to the community ! Best, -- Nan Zhu On Saturday, March 29, 2014 at 11:40 PM, Patrick Wendell wrote: Hey All, We've successfully migrated the Spark JIRA to the Apache infrastructure. This turned out to be a huge effort, lead by Andy Konwinski, who deserves all of our deepest appreciation for managing this complex migration Since Apache runs the same JIRA version as Spark's existing JIRA, there is no new software to learn. A few things to note though: - The issue tracker for Spark is now at: https://issues.apache.org/jira/browse/SPARK - You can sign up to receive an e-mail feed of JIRA updates by e-mailing: issues-subscr...@spark.apache.org (mailto:issues-subscr...@spark.apache.org) - DO NOT create issues on the old JIRA. I'll try to disable this so that it is read-only. - You'll need to create an account at the new site if you don't have one already. - We've imported all the old JIRA's. In some cases the import tool can't correctly guess the assignee for the JIRA, so we may have to do some manual assignment. - If you feel like you don't have sufficient permissions on the new JIRA, please send me an e-mail. I tried to add all of the committers as administrators but I may have missed some. Thanks, Patrick
Re: [VOTE] Release Apache Spark 0.9.1 (RC3)
Ah yes that should be and will be updated! One more update in docs. In the home page of spark streaming http://spark.incubator.apache.org/streaming/. Under Deployment Options It is mentioned that *Spark Streaming can read data from HDFS http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html , Flume http://flume.apache.org/,Kafka http://kafka.apache.org/, Twitter https://dev.twitter.com/ and ZeroMQ http://zeromq.org/*. But from Spark Streaming-0.9.0 on wards it also supports Mqtt. Can you please do the necessary to update the same after the voting has completed? On Sat, Mar 29, 2014 at 9:28 PM, Tathagata Das tathagata.das1...@gmail.comwrote: Small fixes to the docs can be done after the voting has completed. This should not determine the vote on the release candidate binaries. Please vote as +1 if the published artifacts and binaries are good to go. TD On Mar 29, 2014 5:23 AM, prabeesh k prabsma...@gmail.com wrote: https://github.com/apache/spark/blob/master/docs/quick-start.md in line 127. one spelling mistake found please correct it. (proogram to program) On Fri, Mar 28, 2014 at 9:58 PM, Will Benton wi...@redhat.com wrote: RC3 works with the applications I'm working on now and MLLib performance is indeed perceptibly improved over 0.9.0 (although I haven't done a real evaluation). Also, from the downstream perspective, I've been tracking the 0.9.1 RCs in Fedora and have no issues to report there either: http://koji.fedoraproject.org/koji/buildinfo?buildID=507284 [x] +1 Release this package as Apache Spark 0.9.1 [ ] -1 Do not release this package because ...