Sung Hwan, yes, I'm saying exactly what you interpreted, including that if
you tried it, it would (mostly) work, and my uncertainty with respect to
guarantees on the semantics. Definitely there would be no fault tolerance
if the mutations depend on state that is not captured in the RDD lineage.

DDF is to RDD is like RDD is to HDFS. Not a perfect analogy, but the point
is that it's an abstraction above with all attendant implications, plusses
and minusses. With DDFs you get to think of everything as tables with
schemas, while the underlying complexity of mutability and data
representation is hidden away. You also get rich idioms to operate on those
tables like filtering, projection, subsetting, handling of missing data
(NA's), dummy-column generation, data mining statistics and machine
learning, etc. In some aspects it replaces a lot of boiler plate analytics
that you don't want to re-invent over and over again, e.g., FiveNum or
XTabs. So instead of 100 lines of code, it's 4. In other aspects it allows
you to easily apply "arbitrary" machine learning algorithms without having
to think too hard about getting the data types just right. Etc.

But you would also find yourself wanting access to the underlying RDDs for
their full semantics & flexibility.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Fri, Mar 28, 2014 at 8:46 PM, Sung Hwan Chung
<coded...@cs.stanford.edu>wrote:

> Thanks Chris,
>
> I'm not exactly sure what you mean with MutablePair, but are you saying
> that we could create RDD[MutablePair] and modify individual rows?
>
> If so, will that play nicely with RDD's lineage and fault tolerance?
>
> As for the alternatives, I don't think 1 is something we want to do, since
> that would require another complex system we'll have to implement. Is DDF
> going to be an alternative to RDD?
>
> Thanks again!
>
>
>
> On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen <c...@adatao.com>wrote:
>
>> Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
>> get what you want is to transform to another RDD. But you might look at
>> MutablePair (
>> https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
>> to see if the semantics meet your needs.
>>
>> Alternatively you can consider:
>>
>>    1. Build & provide a fast lookup service that stores and returns the
>>    mutable information keyed by the RDD row IDs, or
>>    2. Use DDF (Distributed DataFrame) which we'll make available in the
>>    near future, which will give you fully mutable-row table semantics.
>>
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Hey guys,
>>>
>>> I need to tag individual RDD lines with some values. This tag value
>>> would change at every iteration. Is this possible with RDD (I suppose this
>>> is sort of like mutable RDD, but it's more) ?
>>>
>>> If not, what would be the best way to do something like this? Basically,
>>> we need to keep mutable information per data row (this would be something
>>> much smaller than actual data row, however).
>>>
>>> Thanks
>>>
>>
>>
>

Reply via email to