Sung Hwan, yes, I'm saying exactly what you interpreted, including that if you tried it, it would (mostly) work, and my uncertainty with respect to guarantees on the semantics. Definitely there would be no fault tolerance if the mutations depend on state that is not captured in the RDD lineage.
DDF is to RDD is like RDD is to HDFS. Not a perfect analogy, but the point is that it's an abstraction above with all attendant implications, plusses and minusses. With DDFs you get to think of everything as tables with schemas, while the underlying complexity of mutability and data representation is hidden away. You also get rich idioms to operate on those tables like filtering, projection, subsetting, handling of missing data (NA's), dummy-column generation, data mining statistics and machine learning, etc. In some aspects it replaces a lot of boiler plate analytics that you don't want to re-invent over and over again, e.g., FiveNum or XTabs. So instead of 100 lines of code, it's 4. In other aspects it allows you to easily apply "arbitrary" machine learning algorithms without having to think too hard about getting the data types just right. Etc. But you would also find yourself wanting access to the underlying RDDs for their full semantics & flexibility. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Fri, Mar 28, 2014 at 8:46 PM, Sung Hwan Chung <coded...@cs.stanford.edu>wrote: > Thanks Chris, > > I'm not exactly sure what you mean with MutablePair, but are you saying > that we could create RDD[MutablePair] and modify individual rows? > > If so, will that play nicely with RDD's lineage and fault tolerance? > > As for the alternatives, I don't think 1 is something we want to do, since > that would require another complex system we'll have to implement. Is DDF > going to be an alternative to RDD? > > Thanks again! > > > > On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen <c...@adatao.com>wrote: > >> Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to >> get what you want is to transform to another RDD. But you might look at >> MutablePair ( >> https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala) >> to see if the semantics meet your needs. >> >> Alternatively you can consider: >> >> 1. Build & provide a fast lookup service that stores and returns the >> mutable information keyed by the RDD row IDs, or >> 2. Use DDF (Distributed DataFrame) which we'll make available in the >> near future, which will give you fully mutable-row table semantics. >> >> >> -- >> Christopher T. Nguyen >> Co-founder & CEO, Adatao <http://adatao.com> >> linkedin.com/in/ctnguyen >> >> >> >> On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung < >> coded...@cs.stanford.edu> wrote: >> >>> Hey guys, >>> >>> I need to tag individual RDD lines with some values. This tag value >>> would change at every iteration. Is this possible with RDD (I suppose this >>> is sort of like mutable RDD, but it's more) ? >>> >>> If not, what would be the best way to do something like this? Basically, >>> we need to keep mutable information per data row (this would be something >>> much smaller than actual data row, however). >>> >>> Thanks >>> >> >> >