Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-29 Thread prabeesh k
https://github.com/apache/spark/blob/master/docs/quick-start.md in line
127. one spelling mistake found please correct it. (proogram to program)



On Fri, Mar 28, 2014 at 9:58 PM, Will Benton wi...@redhat.com wrote:

 RC3 works with the applications I'm working on now and MLLib performance
 is indeed perceptibly improved over 0.9.0 (although I haven't done a real
 evaluation).  Also, from the downstream perspective, I've been tracking the
 0.9.1 RCs in Fedora and have no issues to report there either:

http://koji.fedoraproject.org/koji/buildinfo?buildID=507284

 [x] +1 Release this package as Apache Spark 0.9.1
 [ ] -1 Do not release this package because ...



a weird test case in Streaming

2014-03-29 Thread Nan Zhu
Hi, all  

The recovery with file input stream” in the Streaming.CheckpointSuite 
sometimes failed even you are working on a totally irrelevant part, I met this 
problem for 3+ times.

I assume this test case is likely to fail when the testing servers are very 
busy?

Two cases from others:

Sean: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13561/

Mark: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13531/


Best,  

--  
Nan Zhu




[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954345#comment-13954345
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39002053
  
Merged build finished. Build is starting -or- tests failed to complete.


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread Mattmann, Chris A (3980)
Guys I fixed this by adding j...@apache.org to the mailing list, no
more moderation required.

Cheers,
Chris





-Original Message-
From: ASF GitHub Bot   (JIRA) j...@apache.org
Reply-To: dev@spark.apache.org dev@spark.apache.org
Date: Saturday, March 29, 2014 10:14 AM
To: dev@spark.apache.org dev@spark.apache.org
Subject: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result
in duplicated accumulator updates


[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.pl
ugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954345#comm
ent-13954345 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39002053
  
Merged build finished. Build is starting -or- tests failed to
complete.


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1,
0.9.0, 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same
accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this
behavior allows the additon or removal of a cache() call to affect the
outcome of a computation.
 There's an old TODO to fix this duplicate update issue in the
[DAGScheduler 
code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9
d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from
the cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to
implement performance counters that track the actual number of
computations performed (including recomputations).  To be safe, we could
add an explicit warning in the release notes that documents the change
in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few
subtleties.  Currently, we allow accumulators to be used in multiple
transformations, so we'd need to detect duplicate updates at the
per-transformation level.  I haven't dug too deeply into the scheduler
internals, but we might also run into problems where pipelining causes
what is logically one set of accumulator updates to show up in two
different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x;
...).count() may cause what's logically the same accumulator update to
be applied from two different contexts, complicating the detection of
duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)



[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954353#comment-13954353
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39002548
  
 Merged build triggered. Build is starting -or- tests failed to complete.


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954354#comment-13954354
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39002551
  
Merged build started. Build is starting -or- tests failed to complete.


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954363#comment-13954363
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/228#discussion_r11094329
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -789,6 +799,25 @@ class DAGScheduler(
   }
 
   /**
+   * detect the duplicate accumulator value and save the accumulator values
+   * @param accumValue the accumulator values received from the task
+   * @param stage the stage which the task belongs to
+   * @param task the completed task
+   */
+  private def saveAccumulatorValue(accumValue: Map[Long, Any], stage: 
Stage, task: Task[_]) {
+if (accumValue != null 
+  (!stageIdToAccumulators.contains(stage.id) ||
+!stageIdToAccumulators(stage.id).contains(task.partitionId))) {
+  stageIdToAccumulators.getOrElseUpdate(stage.id,
+new HashMap[Int, ListBuffer[(Long, Any)]]).
+getOrElseUpdate(task.partitionId, new ListBuffer[(Long, Any)])
+  for ((id, value) - accumValue) {
+stageIdToAccumulators(stage.id)(task.partitionId) += id - value
--- End diff --

nit: you can avoid the lookup within the loop by using the value returned 
in the getOrElseUpdate calls.


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954368#comment-13954368
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39003471
  
looks good @CodingCat ! just made a few minor points.
excellent change !!


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


JIRA. github and asf updates

2014-03-29 Thread Mridul Muralidharan
Hi,

  So we are now receiving updates from three sources for each change to the PR.
While each of them handles a corner case which others might miss,
would be great if we could minimize the volume of duplicated
communication.


Regards,
Mridul


Re: JIRA. github and asf updates

2014-03-29 Thread Patrick Wendell
Mridul,

You can unsubscribe yourself from any of these sources, right?

- Patrick


On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.comwrote:

 Hi,

   So we are now receiving updates from three sources for each change to
 the PR.
 While each of them handles a corner case which others might miss,
 would be great if we could minimize the volume of duplicated
 communication.


 Regards,
 Mridul



[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954379#comment-13954379
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39004143
  
Build is starting -or- tests failed to complete.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13579/


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954384#comment-13954384
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39004325
  
Can we just add the accumulator update to TaskSetManager, in the 
handleSuccessfulTask() method?  This seems much simpler because the 
TaskSetManager already has all of the state about which tasks are running, 
which ones have been resubmitted or speculated, etc.  I think this change would 
be much simpler.

Over time, a lot of functionality has leaked into the DAGScheduler, such 
that there's a lot of state that's kept in multiple places: in the DAGScheduler 
and in the TaskSetManager or the TaskSchedulerImpl.  The abstraction is 
supposed to be that the DAGScheduler handles the high level semantics of 
scheduling stages and dealing with inter-stage dependencies, and the 
TaskSetManager handles the low-level details of the tasks for each stage.  
There are some parts of this abstraction that are currently broken (where the 
DAGScheduler knows too much about task-level details) and refactoring this is 
on my todo list, but in the meantime I think we should try not to make this 
problem any worse, because it makes the code much more complicated, more 
difficult to understand, and buggy.


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Patrick Wendell
Hey Chris,

I don't think our JIRA has been fully migrated to Apache infra, so it's
really confusing to send people e-mails referring to the new JIRA since we
haven't announced it yet. There is some content there because we've been
trying to do the migration, but I'm not sure it's entirely finished.

Also, right now our github comments go to a commits@ list. I'm actually -1
copying all of these to JIRA because we do a bunch of review level comments
that are going to pollute the JIRA a bunch.

In any case, can you revert the change whatever it was that sent these to
the dev list? We should have a coordinated plan about this transition and
the e-mail changes we plan to make.

- Patrick


Re: JIRA. github and asf updates

2014-03-29 Thread Henry Saputra
With the speed of comments updates in Jira by Spark dev community +1 for
issues@ list

- Henry

On Saturday, March 29, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Ah sorry I see - Jira updates are going to the dev list. Maybe that's not
 desirable. I think we should send them to the issues@ list.


 On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell 
 pwend...@gmail.comjavascript:;
 wrote:

  Mridul,
 
  You can unsubscribe yourself from any of these sources, right?
 
  - Patrick
 
 
  On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan 
  mri...@gmail.comjavascript:;
 wrote:
 
  Hi,
 
So we are now receiving updates from three sources for each change to
  the PR.
  While each of them handles a corner case which others might miss,
  would be great if we could minimize the volume of duplicated
  communication.
 
 
  Regards,
  Mridul
 
 
 



Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Patrick Wendell
Okay I think I managed to revert this by just removing jira@a.o from our
dev list.


On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell pwend...@gmail.comwrote:

 Hey Chris,

 I don't think our JIRA has been fully migrated to Apache infra, so it's
 really confusing to send people e-mails referring to the new JIRA since we
 haven't announced it yet. There is some content there because we've been
 trying to do the migration, but I'm not sure it's entirely finished.

 Also, right now our github comments go to a commits@ list. I'm actually
 -1 copying all of these to JIRA because we do a bunch of review level
 comments that are going to pollute the JIRA a bunch.

 In any case, can you revert the change whatever it was that sent these to
 the dev list? We should have a coordinated plan about this transition and
 the e-mail changes we plan to make.

 - Patrick



[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954423#comment-13954423
 ] 

ASF GitHub Bot commented on SPARK-732:
--

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/228#issuecomment-39007448
  
Hi, Kay, I will think about it, and see if we can move accumulator related 
functionalities to tm entirely.   


 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Apache Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 
 0.8.2
Reporter: Josh Rosen
Assignee: Nan Zhu
 Fix For: 1.0.0


 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: JIRA. github and asf updates

2014-03-29 Thread Mridul Muralidharan
If the PR comments are going to be replicated into the jira's and they
are going to be set to dev@, then we could keep that and remove
[Github] updates ?
The last was added since discussions were happening off apache lists -
which should be handled by the jira updates ?

I dont mind the mails if they had content - this is just duplication
of the same message in three mails :-)
Btw, this is a good problem to have - a vibrant and very actively
engaged community generated a lot of meaningful traffic !
I just dont want to get distracted from it by repetitions.

Regards,
Mridul


On Sat, Mar 29, 2014 at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote:
 Ah sorry I see - Jira updates are going to the dev list. Maybe that's not
 desirable. I think we should send them to the issues@ list.


 On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.comwrote:

 Mridul,

 You can unsubscribe yourself from any of these sources, right?

 - Patrick


 On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan 
 mri...@gmail.comwrote:

 Hi,

   So we are now receiving updates from three sources for each change to
 the PR.
 While each of them handles a corner case which others might miss,
 would be great if we could minimize the volume of duplicated
 communication.


 Regards,
 Mridul





Re: JIRA. github and asf updates

2014-03-29 Thread Patrick Wendell
I'm working with infra to get the following set-up:

1. Don't post github updates to jira comments (they are too low level). If
users want these they can subscribe to commits@s.a.o.
2. Jira comment stream will go to issues@s.a.o so people can opt into that.

One thing YARN has set-up that might be cool in the future is to e-mail
*new* JIRA's to the dev list. That might be cool to set up in the future.


On Sat, Mar 29, 2014 at 1:15 PM, Mridul Muralidharan mri...@gmail.comwrote:

 If the PR comments are going to be replicated into the jira's and they
 are going to be set to dev@, then we could keep that and remove
 [Github] updates ?
 The last was added since discussions were happening off apache lists -
 which should be handled by the jira updates ?

 I dont mind the mails if they had content - this is just duplication
 of the same message in three mails :-)
 Btw, this is a good problem to have - a vibrant and very actively
 engaged community generated a lot of meaningful traffic !
 I just dont want to get distracted from it by repetitions.

 Regards,
 Mridul


 On Sat, Mar 29, 2014 at 11:46 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Ah sorry I see - Jira updates are going to the dev list. Maybe that's not
  desirable. I think we should send them to the issues@ list.
 
 
  On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Mridul,
 
  You can unsubscribe yourself from any of these sources, right?
 
  - Patrick
 
 
  On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  Hi,
 
So we are now receiving updates from three sources for each change to
  the PR.
  While each of them handles a corner case which others might miss,
  would be great if we could minimize the volume of duplicated
  communication.
 
 
  Regards,
  Mridul
 
 
 



Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Mattmann, Chris A (3980)
No worries, thanks Patrick, agreed.


-Original Message-
From: Patrick Wendell pwend...@gmail.com
Date: Saturday, March 29, 2014 1:47 PM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov
Cc: Chris Mattmann mattm...@apache.org, dev@spark.apache.org
dev@spark.apache.org
Subject: Re: Could you undo the JIRA dev list e-mails?

Okay cool - sorry about that. Infra should be able to migrate these over
to an issues@ list shortly. I'd rather bother a few moderators than the
entire dev list... but ya I realize it's annoying :/


On Sat, Mar 29, 2014 at 1:22 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:

I reverted this Patrick, per your request:

[hermes] 8:21pm spark.apache.org http://spark.apache.org  ezmlm-list
dev | grep jira
j...@apache.org
[hermes] 8:21pm spark.apache.org http://spark.apache.org  ezmlm-unsub
dev
j...@apache.org
[hermes] 8:21pm spark.apache.org http://spark.apache.org  ezmlm-list
dev | grep jira
[hermes] 8:21pm spark.apache.org http://spark.apache.org 

Note, that I an other moderators will now receive moderation

emails until the infra ticket is fixed but others will not.
I'll set up a mail filter.

Chris


-Original Message-
From: Mattmann, Chris Mattmann chris.a.mattm...@jpl.nasa.gov
Date: Saturday, March 29, 2014 1:11 PM
To: Patrick Wendell pwend...@gmail.com, Chris Mattmann
mattm...@apache.org
Cc: dev@spark.apache.org dev@spark.apache.org
Subject: Re: Could you undo the JIRA dev list e-mails?

Patrick,

No problem -- at the same time realize that I and the other
moderators were getting spammed by moderation emails from JIRA,

so you should take that into consideration as well.

Cheers,
Chris


-Original Message-
From: Patrick Wendell pwend...@gmail.com
Date: Saturday, March 29, 2014 11:59 AM
To: Chris Mattmann mattm...@apache.org
Cc: d...@spark.incubator.apache.org d...@spark.incubator.apache.org
Subject: Re: Could you undo the JIRA dev list e-mails?

Okay I think I managed to revert this by just removing jira@a.o from our
dev list.


On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell
pwend...@gmail.com wrote:

Hey Chris,


I don't think our JIRA has been fully migrated to Apache infra, so it's
really confusing to send people e-mails referring to the new JIRA since
we haven't announced it yet. There is some content there because we've
been trying to do the migration, but
 I'm not sure it's entirely finished.


Also, right now our github comments go to a commits@ list. I'm actually
-1 copying all of these to JIRA because we do a bunch of review level
comments that are going to pollute the JIRA a bunch.


In any case, can you revert the change whatever it was that sent these
to
the dev list? We should have a coordinated plan about this transition
and
the e-mail changes we plan to make.


- Patrick


















Re: Travis CI

2014-03-29 Thread Nan Zhu
Hi,   

Is the migration from Jenkins to Travis finished?

I think Travis is actually not stable based on the observations in these days 
(and Jenkins becomes unstable too……  :-(  ), I’m actively working on two PRs 
related to DAGScheduler, I saw

Problem on Travis:  

1. test “large number of iterations”  in BagelSuite sometimes failed, because 
it doesn’t output anything within 10 seconds

2. hive/test usually aborted because it doesn’t output anything within 10 
minutes

3. a test case in Streaming.CheckpointSuite failed  

4. hive/test didn’t finish in 50 minutes, and was aborted

Problem on Jenkins:

1. didn’t finish in 90mins, and the process is aborted

2. the same as 3 in Travis problem

Some of these problems once appeared in Jenkins months, but not so often

I’m not complaining, I know that the admins are working hard to make the 
community run in a good condition on every aspect,  

I’m just reporting what I saw and hope that can help you to identify the problem

Thank you  

--  
Nan Zhu


On Tuesday, March 25, 2014 at 10:11 PM, Patrick Wendell wrote:

 Ya It's been a little bit slow lately because of a high error rate in
 interactions with the git-hub API. Unfortunately we are pretty slammed
 for the release and haven't had a ton of time to do further debugging.
  
 - Patrick
  
 On Tue, Mar 25, 2014 at 7:13 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  I just found that the Jenkins is not working from this afternoon
   
  for one PR, the first time build failed after 90 minutes, the second time it
  has run for more than 2 hours, no result is returned
   
  Best,
   
  --
  Nan Zhu
   
   
  On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote:
   
  That's not correct - like Michael said the Jenkins build remains the
  reference build for now.
   
  On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   
  I assume the Jenkins is not working now?
   
  Best,
   
  --
  Nan Zhu
   
   
  On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:
   
  Just a quick note to everyone that Patrick and I are playing around with
  Travis CI on the Spark github repository. For now, travis does not run all
  of the test cases, so will only be turned on experimentally. Long term it
  looks like Travis might give better integration with github, so we are
  going to see if it is feasible to get all of our tests running on it.
   
  *Jenkins remains the reference CI and should be consulted before merging
  pull requests, independent of what Travis says.*
   
  If you have any questions or want to help out with the investigation, let
  me know!
   
  Michael  



Re: Travis CI

2014-03-29 Thread Michael Armbrust

 Is the migration from Jenkins to Travis finished?


It is not finished and really at this point it is only something we are
considering, not something that will happen for sure.  We turned it on in
addition to Jenkins so that we could start finding issues exactly like the
ones you described below to determine if Travis is going to be a viable
option.

Basically it seems to me that the Travis environment is a little less
predictable (probably because of virtualization) and this is pointing out
some existing flakey-ness in the tests

If there are tests that are regularly flakey we should probably file JIRAs
so they can be fixed or switched off.  If you have seen a test fail 2-3
times and then pass with no changes, I'd say go ahead and file an issue for
it (others should feel free to chime in if we want some other process here)

A few more specific comments inline below.


 2. hive/test usually aborted because it doesn't output anything within 10
 minutes


Hmm, this is a little confusing.  Do you have a pointer to this one?  Was
there any other error?


 4. hive/test didn't finish in 50 minutes, and was aborted


Here I think the right thing to do is probably break the hive tests in two
and run them in parallel.  There is already machinery for doing this, we
just need to flip the options on in the travis.yml to make it happen.  This
is only going to get more critical as we whitelist more hive tests.  We
also talked about checking the PR and skipping the hive tests when there
have been no changes in catalyst/sql/hive.  I'm okay with this plan, just
need to find someone with time to implement it


Re: Migration to the new Spark JIRA

2014-03-29 Thread Nan Zhu
That’s great!  

Andy, thank you for all your contributions to the community !

Best,  

--  
Nan Zhu


On Saturday, March 29, 2014 at 11:40 PM, Patrick Wendell wrote:

 Hey All,
  
 We've successfully migrated the Spark JIRA to the Apache infrastructure.
 This turned out to be a huge effort, lead by Andy Konwinski, who deserves
 all of our deepest appreciation for managing this complex migration
  
 Since Apache runs the same JIRA version as Spark's existing JIRA, there is
 no new software to learn. A few things to note though:
  
 - The issue tracker for Spark is now at:
 https://issues.apache.org/jira/browse/SPARK
  
 - You can sign up to receive an e-mail feed of JIRA updates by e-mailing:
 issues-subscr...@spark.apache.org (mailto:issues-subscr...@spark.apache.org)
  
 - DO NOT create issues on the old JIRA. I'll try to disable this so that it
 is read-only.
  
 - You'll need to create an account at the new site if you don't have one
 already.
  
 - We've imported all the old JIRA's. In some cases the import tool can't
 correctly guess the assignee for the JIRA, so we may have to do some manual
 assignment.
  
 - If you feel like you don't have sufficient permissions on the new JIRA,
 please send me an e-mail. I tried to add all of the committers as
 administrators but I may have missed some.
  
 Thanks,
 Patrick
  
  




Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-29 Thread Tathagata Das
Ah yes that should be and will be updated!
One more update in docs.

In the home page of spark streaming
http://spark.incubator.apache.org/streaming/.
Under
Deployment Options
 It is mentioned that *Spark Streaming can read data from HDFS

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
,
Flume
http://flume.apache.org/,Kafka http://kafka.apache.org/, Twitter
https://dev.twitter.com/ and ZeroMQ http://zeromq.org/*.

But from Spark Streaming-0.9.0 on wards it also supports Mqtt.

Can you please do the necessary to update the same after the voting has
completed?


On Sat, Mar 29, 2014 at 9:28 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:

 Small fixes to the docs can be done after the voting has completed. This
 should not determine the vote on the release candidate binaries. Please
 vote as +1 if the published artifacts and binaries are good to go.

 TD
 On Mar 29, 2014 5:23 AM, prabeesh k prabsma...@gmail.com wrote:

  https://github.com/apache/spark/blob/master/docs/quick-start.md in line
  127. one spelling mistake found please correct it. (proogram to program)
 
 
 
  On Fri, Mar 28, 2014 at 9:58 PM, Will Benton wi...@redhat.com wrote:
 
   RC3 works with the applications I'm working on now and MLLib
 performance
   is indeed perceptibly improved over 0.9.0 (although I haven't done a
 real
   evaluation).  Also, from the downstream perspective, I've been
tracking
  the
   0.9.1 RCs in Fedora and have no issues to report there either:
  
  http://koji.fedoraproject.org/koji/buildinfo?buildID=507284
  
   [x] +1 Release this package as Apache Spark 0.9.1
   [ ] -1 Do not release this package because ...