Re: Does foreach operation increase rdd lineage?

2014-01-25 Thread Mark Hamstra
Or just checkpoint() it. On Sat, Jan 25, 2014 at 2:40 PM, Jason Lenderman wrote: > RDDs are supposed to be immutable. Changing values using foreach seems > like a bad thing to do, and is going to mess up the probability in some > very difficult to understand fashion if you wind up losing a parti

Re: Does foreach operation increase rdd lineage?

2014-01-25 Thread Jason Lenderman
RDDs are supposed to be immutable. Changing values using foreach seems like a bad thing to do, and is going to mess up the probability in some very difficult to understand fashion if you wind up losing a partition of your state that needs to be regenerated. Each update of the state of your markov

Re: Does foreach operation increase rdd lineage?

2014-01-24 Thread 尹绪森
foreach is an action, from the source code you can see that it call runJob method. In spark, it is difficult to change data in place, for it has a functional semantic. I think "mapPartitions" is more suitable for machine learning algorithms. I am writing a LDA for mllib, you can have a look if you

Re: Does foreach operation increase rdd lineage?

2014-01-24 Thread guojc
Yes, I means Gibbs sampling. From the api document, I don't see why the data will be collected to driver. The document say that ' def foreach(f: (T) => Unit): Unit Applies a function f to all elements of this RDD.' So If I want to change my data in place, what operation I should use? Best Regards

Re: Does foreach operation increase rdd lineage?

2014-01-24 Thread 尹绪森
Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will collect all data from workers to driver. You will get OOM complained by JVM. I am not very sure of your implementation, but if data not need to join together, you'd better keep them in workers. 2014/1/24 guojc > Hi, >I'

Does foreach operation increase rdd lineage?

2014-01-24 Thread guojc
Hi, I'm writing a paralell mcmc program that having a very large dataset in memory, and need to update the dataset in-memory and avoid creating additional copy. Should I choose a foreach operation on rdd to express the change? or I have to create a new rdd after each sampling process? Thanks, J