Re: A couple questions about shared variables

2014-09-24 Thread Nan Zhu
I proposed a fix https://github.com/apache/spark/pull/2524 Glad to receive feedbacks -- Nan Zhu On Tuesday, September 23, 2014 at 9:06 PM, Sandy Ryza wrote: Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting these nuances. -Sandy On Mon, Sep 22, 2014 at

Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
If you think it as necessary to fix, I would like to resubmit that PR (seems to have some conflicts with the current DAGScheduler) My suggestion is to make it as an option in accumulator, e.g. some algorithms utilizing accumulator for result calculation, it needs a deterministic accumulator,

Re: A couple questions about shared variables

2014-09-22 Thread Sandy Ryza
MapReduce counters do not count duplications. In MapReduce, if a task needs to be re-run, the value of the counter from the second task overwrites the value from the first task. -Sandy On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu zhunanmcg...@gmail.com wrote: If you think it as necessary to fix,

Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
I see, thanks for pointing this out -- Nan Zhu On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote: MapReduce counters do not count duplications. In MapReduce, if a task needs to be re-run, the value of the counter from the second task overwrites the value from the first

Re: A couple questions about shared variables

2014-09-21 Thread Matei Zaharia
Hmm, good point, this seems to have been broken by refactorings of the scheduler, but it worked in the past. Basically the solution is simple -- in a result stage, we should not apply the update for each task ID more than once -- the same way we don't call job.listener.taskSucceeded more than

Re: A couple questions about shared variables

2014-09-20 Thread Matei Zaharia
Hey Sandy, On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All,  A couple questions came up about shared variables recently, and I wanted to  confirm my understanding and update the doc to be a little more clear.  *Broadcast variables*  Now that tasks data