[ 
https://issues.apache.org/jira/browse/SPARK-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-6075:
---------------------------------

    Assignee: Josh Rosen

> After SPARK-3885, some tasks' accumulator updates may be lost
> -------------------------------------------------------------
>
>                 Key: SPARK-6075
>                 URL: https://issues.apache.org/jira/browse/SPARK-6075
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Tests
>    Affects Versions: 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Blocker
>
> It looks like some of the AccumulatorSuite tests have started failing 
> nondeterministically on Jenkins.  The errors seem to be due to lost / missing 
> accumulator updates, e.g.
> {code}
> Set(843, 356, 437, [...], 181, 618, 131) did not contain element 901
> {code}
> This could somehow be related to SPARK-3885 / 
> https://github.com/apache/spark/pull/4021, a patch to garbage-collect 
> accumulators, which was only merged into master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/lastCompletedBuild/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/org.apache.spark/AccumulatorSuite/add_value_to_collection_accumulators/
> I think I've figured it out: consider the lifecycle of an accumulator in a 
> task, say ShuffleMapTask: on the executor, each task deserializes its own 
> copy of the RDD inside of its runTask method, so the strong reference to the 
> RDD disappears at the end of runTask. In Executor.run(), we call 
> Accumulators.values after runTask has exited, so there's a small window in 
> which the tasks's RDD can be GC'd, causing accumulators to be GC'd as well 
> because there are no longer any strong references to them.
> The fix is to keep strong references in localAccums, since we clear this at 
> the end of each task anyways. I'm glad that I was able to figure out 
> precisely why this was necessary and sorry that I missed this during review; 
> I'll submit a fix shortly. In terms of preventative measures, it might be a 
> good idea to write up the lifetime / lifecycle of objects' strong references 
> whenever we're using WeakReferences, since the process of explicitly writing 
> that out would prevent these sorts of mistakes in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to