Yes, I means Gibbs sampling. From the api document, I don't see why the data will be collected to driver. The document say that ' def foreach(f: (T) => Unit): Unit Applies a function f to all elements of this RDD.'
So If I want to change my data in place, what operation I should use? Best Regards, Jiacheng Guo On Fri, Jan 24, 2014 at 9:03 PM, 尹绪森 <[email protected]> wrote: > Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will > collect all data from workers to driver. You will get OOM complained by JVM. > > I am not very sure of your implementation, but if data not need to join > together, you'd better keep them in workers. > > > 2014/1/24 guojc <[email protected]> > >> Hi, >> I'm writing a paralell mcmc program that having a very large dataset >> in memory, and need to update the dataset in-memory and avoid creating >> additional copy. Should I choose a foreach operation on rdd to express the >> change? or I have to create a new rdd after each sampling process? >> >> Thanks, >> Jiacheng Guo >> > > > > -- > Best Regards > ----------------------------------- > Xusen Yin 尹绪森 > Beijing Key Laboratory of Intelligent Telecommunications Software and > Multimedia > Beijing University of Posts & Telecommunications > Intel Labs China > Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >
