Re: Consistency between RDD's and Native File System

Christopher Nguyen Thu, 16 Jan 2014 21:34:28 -0800

Mark, that's precisely why I brought up lineage, in order to say I didn't
want to muddy the issue there :)


--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <[email protected]>wrote:

> I don't agree entirely, Christopher.  Without persisting or checkpointing
> RDDs, re-evaluation of the lineage will pick up source changes.  I'm not
> saying that working this way is a good idea (in fact, it's generally not),
> but you can do things like this:
>
> 1) Create file silliness.txt containing:
>
> one line
> two line
> red line
> blue line
>
> 2) Fire up spark-shell and do this:
>
> scala> val lines = sc.textFile("silliness.txt")
> scala> println(lines.collect.mkString(", "))
> .
> .
> .
> one line, two line, red line, blue line
>
> 3) Edit silliness.txt so that it is now:
>
> and now
> for something
> completely
> different
>
> 4) Continue on with spark-shell:
>
> scala> println(lines.collect.mkString(", "))
> .
> .
> .
> and now, for something, completely, different
>
>
> On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <[email protected]>wrote:
>
>> Sai, from your question, I infer that you have an interpretation that
>> RDDs are somehow an in-memory/cached copy of the underlying data
>> source---and so there is some expectation that there is some
>> synchronization model between the two.
>>
>> That would not be what the RDD model is. RDDs are first-class,
>> stand-alone (distributed, immutable) datasets. Once created, an RDD exists
>> on its own and isn't expected to somehow automatically realize that some
>> underlying source has changed. (There is the concept of lineage or
>> provenance for recomputation of RDDs, but that's orthogonal to this
>> interpretation so I won't muddy the issue here).
>>
>> If you're looking for a mutable data table model, we will soon be
>> releasing to open source something called Distributed DataFrame (DDF, based
>> on R's data.frame) on top of RDDs that allows you to, among other useful
>> things, load a dataset, perform transformations on it, and save it back,
>> all the while holding on to a single DDF reference.
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <[email protected]>wrote:
>>
>>> Thanks Patrick, but i think i dint put my question clearly...
>>>
>>> The question is Say in the native file system or HDFS, i have data
>>> describing students who passed, failed and for whom results are with-held
>>> for some reason.
>>> *Time T1:*
>>> x - Pass
>>> y - Fail
>>> z - With-held.
>>>
>>> *Time T2:*
>>> So i create an RDD1 reflecting this data, run a query to find how many
>>> candidates have passed.
>>> RESULT = 1. RDD1 is cached or its stored in the file system depending on
>>> the availability of space.
>>>
>>> *Time T3:*
>>> In the native file system, now that results of the z are out and
>>> declared passed. So HDFS will need to be modified.
>>> x - Pass
>>> y - Fail
>>> z - Pass.
>>> Say now i get the RDD1 that is there in file system or cached copy and
>>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>>
>>> So i was asking is there a way SPARK hints that RDD1 is no longer
>>> consistent with the file system or that its upto the programmer to recreate
>>> the RDD1 if the block from where RDD was created was changed at a later
>>> point of time.
>>> [T1 < T2 < T3 < T4]
>>>
>>> Thanks in advance...
>>>
>>>
>>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <[email protected]>wrote:
>>>
>>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>>> block in-place inside of an RDD. As a result, this particular
>>>> consistency issue doesn't come up in Spark.
>>>>
>>>> - Patrick
>>>>
>>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <[email protected]>
>>>> wrote:
>>>> > Hello, i am a novice to SPARK
>>>> >
>>>> > Say that we have created an RDD1 from native file system/HDFS and
>>>> done some
>>>> > transformations and actions and that resulted in an RDD2. Lets assume
>>>> RDD1
>>>> > and RDD2 are persisted, cached in-memory. If the block from where
>>>> RDD1 was
>>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2
>>>> > T1,
>>>> > is there a way either SPARK ensures consistency or it is upto the
>>>> programmer
>>>> > to make it explicit?
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>>> > Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>>
>>>
>>>
>>>
>>> --
>>> *Sai Prasanna. AN*
>>> *II M.Tech (CS), SSSIHL*
>>>
>>>
>>> * Entire water in the ocean can never sink a ship, Unless it gets
>>> inside. All the pressures of life can never hurt you, Unless you let them
>>> in.*
>>>
>>
>>
>

Re: Consistency between RDD's and Native File System

Reply via email to