[ https://issues.apache.org/jira/browse/SPARK-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen reassigned SPARK-6075: --------------------------------- Assignee: Josh Rosen > After SPARK-3885, some tasks' accumulator updates may be lost > ------------------------------------------------------------- > > Key: SPARK-6075 > URL: https://issues.apache.org/jira/browse/SPARK-6075 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests > Affects Versions: 1.4.0 > Reporter: Josh Rosen > Assignee: Josh Rosen > Priority: Blocker > > It looks like some of the AccumulatorSuite tests have started failing > nondeterministically on Jenkins. The errors seem to be due to lost / missing > accumulator updates, e.g. > {code} > Set(843, 356, 437, [...], 181, 618, 131) did not contain element 901 > {code} > This could somehow be related to SPARK-3885 / > https://github.com/apache/spark/pull/4021, a patch to garbage-collect > accumulators, which was only merged into master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/lastCompletedBuild/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/org.apache.spark/AccumulatorSuite/add_value_to_collection_accumulators/ > I think I've figured it out: consider the lifecycle of an accumulator in a > task, say ShuffleMapTask: on the executor, each task deserializes its own > copy of the RDD inside of its runTask method, so the strong reference to the > RDD disappears at the end of runTask. In Executor.run(), we call > Accumulators.values after runTask has exited, so there's a small window in > which the tasks's RDD can be GC'd, causing accumulators to be GC'd as well > because there are no longer any strong references to them. > The fix is to keep strong references in localAccums, since we clear this at > the end of each task anyways. I'm glad that I was able to figure out > precisely why this was necessary and sorry that I missed this during review; > I'll submit a fix shortly. In terms of preventative measures, it might be a > good idea to write up the lifetime / lifecycle of objects' strong references > whenever we're using WeakReferences, since the process of explicitly writing > that out would prevent these sorts of mistakes in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org