Re: Consistency between RDD's and Native File System

Christopher Nguyen Thu, 16 Jan 2014 19:54:46 -0800

Sai, from your question, I infer that you have an interpretation that RDDs
are somehow an in-memory/cached copy of the underlying data source---and so
there is some expectation that there is some synchronization model between
the two.


That would not be what the RDD model is. RDDs are first-class, stand-alone
(distributed, immutable) datasets. Once created, an RDD exists on its own
and isn't expected to somehow automatically realize that some underlying
source has changed. (There is the concept of lineage or provenance for
recomputation of RDDs, but that's orthogonal to this interpretation so I
won't muddy the issue here).

If you're looking for a mutable data table model, we will soon be releasing
to open source something called Distributed DataFrame (DDF, based on R's
data.frame) on top of RDDs that allows you to, among other useful things,
load a dataset, perform transformations on it, and save it back, all the
while holding on to a single DDF reference.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <[email protected]>wrote:

> Thanks Patrick, but i think i dint put my question clearly...
>
> The question is Say in the native file system or HDFS, i have data
> describing students who passed, failed and for whom results are with-held
> for some reason.
> *Time T1:*
> x - Pass
> y - Fail
> z - With-held.
>
> *Time T2:*
> So i create an RDD1 reflecting this data, run a query to find how many
> candidates have passed.
> RESULT = 1. RDD1 is cached or its stored in the file system depending on
> the availability of space.
>
> *Time T3:*
> In the native file system, now that results of the z are out and declared
> passed. So HDFS will need to be modified.
> x - Pass
> y - Fail
> z - Pass.
> Say now i get the RDD1 that is there in file system or cached copy and run
> the same query, i get the RESULT = 1, but ideally RESULT is 2.
>
> So i was asking is there a way SPARK hints that RDD1 is no longer
> consistent with the file system or that its upto the programmer to recreate
> the RDD1 if the block from where RDD was created was changed at a later
> point of time.
> [T1 < T2 < T3 < T4]
>
> Thanks in advance...
>
>
> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <[email protected]>wrote:
>
>> RDD's are immutable, so there isn't really such a thing as modifying a
>> block in-place inside of an RDD. As a result, this particular
>> consistency issue doesn't come up in Spark.
>>
>> - Patrick
>>
>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <[email protected]>
>> wrote:
>> > Hello, i am a novice to SPARK
>> >
>> > Say that we have created an RDD1 from native file system/HDFS and done
>> some
>> > transformations and actions and that resulted in an RDD2. Lets assume
>> RDD1
>> > and RDD2 are persisted, cached in-memory. If the block from where RDD1
>> was
>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2 >
>> T1,
>> > is there a way either SPARK ensures consistency or it is upto the
>> programmer
>> > to make it explicit?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
> * Entire water in the ocean can never sink a ship, Unless it gets inside.
> All the pressures of life can never hurt you, Unless you let them in.*
>

Re: Consistency between RDD's and Native File System

Reply via email to