Bao, to help clarify what TD is saying: Spark launches multiple workers on multiple threads in parallel, running the same closure code in the same JVM on the same machine, but operating on different rows of data.
Because of this parallelism, if that worker code weren't thread-safe for some reason, you'd have a problem. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Mon, Dec 30, 2013 at 4:27 PM, Tathagata Das <[email protected]>wrote: > No I did not mean that. What I meant was something more simple. Let's say > the ScriptEngine maintains some internal state and the function > ScriptEngine.eval(...) is not thread-safe. That is, calling > ScriptEngine.eval simultaneously from multiple threads would cause race > conditions in the internal state and eval() would give incorrect answers. > That would be a problem if you use ScriptEngine in a map function, because > multiple threads in a worker JVM may be running the map function > simultaneously. Something you should be aware of when using static stateful > objects within Spark. > > TD > > > On Sun, Dec 29, 2013 at 7:32 PM, Bao <[email protected]> wrote: > >> Thanks guys, that's interesting. Though it looks like singleton object is >> defined at driver, spark actually will serialize closure and send to >> workers. The interesting thing is that ScriptEngine is NOT serializable, >> but >> till it hasn't been initialized spark can serialize the closure well. But >> if >> I force it initialize first then spark throws NotSerializeableException. >> >> Anyway, following Christopher's suggestion to avoid reference to outside >> closure is better. >> >> TD, do you mean that Executors share the same SerializerInstance and there >> is a case that more than 1 thread call the same closure instance? >> >> -Bao. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Stateful-RDD-tp71p97.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >
