Or just checkpoint() it.
On Sat, Jan 25, 2014 at 2:40 PM, Jason Lenderman wrote:
> RDDs are supposed to be immutable. Changing values using foreach seems
> like a bad thing to do, and is going to mess up the probability in some
> very difficult to understand fashion if you wind up losing a parti
RDDs are supposed to be immutable. Changing values using foreach seems like
a bad thing to do, and is going to mess up the probability in some very
difficult to understand fashion if you wind up losing a partition of your
state that needs to be regenerated.
Each update of the state of your markov
foreach is an action, from the source code you can see that it call runJob
method. In spark, it is difficult to change data in place, for it has a
functional semantic.
I think "mapPartitions" is more suitable for machine learning algorithms. I
am writing a LDA for mllib, you can have a look if you
Yes, I means Gibbs sampling. From the api document, I don't see why the
data will be collected to driver. The document say that '
def foreach(f: (T) => Unit): Unit
Applies a function f to all elements of this RDD.'
So If I want to change my data in place, what operation I should use?
Best Regards
Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will
collect all data from workers to driver. You will get OOM complained by JVM.
I am not very sure of your implementation, but if data not need to join
together, you'd better keep them in workers.
2014/1/24 guojc
> Hi,
>I'
Hi,
I'm writing a paralell mcmc program that having a very large dataset in
memory, and need to update the dataset in-memory and avoid creating
additional copy. Should I choose a foreach operation on rdd to express the
change? or I have to create a new rdd after each sampling process?
Thanks,
J