Or just checkpoint() it.
On Sat, Jan 25, 2014 at 2:40 PM, Jason Lenderman <[email protected]>wrote: > RDDs are supposed to be immutable. Changing values using foreach seems > like a bad thing to do, and is going to mess up the probability in some > very difficult to understand fashion if you wind up losing a partition of > your state that needs to be regenerated. > > Each update of the state of your markov chain should be a new RDD. I've > found that I can do this for 100 or 200 iterations and then I'll get a > stack overflow (presumably because the lineage is growing too large.) To > get around this you either need to occasionally collect the RDD or write it > to disk. > > > On Fri, Jan 24, 2014 at 5:22 AM, 尹绪森 <[email protected]> wrote: > >> foreach is an action, from the source code you can see that it call >> runJob method. In spark, it is difficult to change data in place, for it >> has a functional semantic. >> >> I think "mapPartitions" is more suitable for machine learning algorithms. >> I am writing a LDA for mllib, you can have a look if you like, but not very >> deep optimized yet. I will do more extra work to optimize it. >> >> >> https://github.com/yinxusen/incubator-spark/blob/lda-mahout/mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala >> >> >> >> 2014/1/24 guojc <[email protected]> >> >>> Yes, I means Gibbs sampling. From the api document, I don't see why the >>> data will be collected to driver. The document say that ' >>> def foreach(f: (T) => Unit): Unit >>> Applies a function f to all elements of this RDD.' >>> >>> So If I want to change my data in place, what operation I should use? >>> >>> Best Regards, >>> Jiacheng Guo >>> >>> >>> On Fri, Jan 24, 2014 at 9:03 PM, 尹绪森 <[email protected]> wrote: >>> >>>> Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will >>>> collect all data from workers to driver. You will get OOM complained by >>>> JVM. >>>> >>>> I am not very sure of your implementation, but if data not need to join >>>> together, you'd better keep them in workers. >>>> >>>> >>>> 2014/1/24 guojc <[email protected]> >>>> >>>>> Hi, >>>>> I'm writing a paralell mcmc program that having a very large >>>>> dataset in memory, and need to update the dataset in-memory and avoid >>>>> creating additional copy. Should I choose a foreach operation on rdd to >>>>> express the change? or I have to create a new rdd after each sampling >>>>> process? >>>>> >>>>> Thanks, >>>>> Jiacheng Guo >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> ----------------------------------- >>>> Xusen Yin 尹绪森 >>>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>>> Multimedia >>>> Beijing University of Posts & Telecommunications >>>> Intel Labs China >>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>>> >>> >>> >> >> >> -- >> Best Regards >> ----------------------------------- >> Xusen Yin 尹绪森 >> Beijing Key Laboratory of Intelligent Telecommunications Software and >> Multimedia >> Beijing University of Posts & Telecommunications >> Intel Labs China >> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >> > >
