Sai, to be sure, what Mark said regarding lineage and recomputation is
exactly correct, so if it matters in your use case, you shouldn't ignore
this behavior, even as a side effect.

It just isn't what I think you were expecting in terms of RDD guarantees,
e.g., somehow there is a signal sent to your driver or workers that the
"original source" has changed. Further, there are no guarantees that Spark
hasn't decided to checkpoint the lineage somewhere and is no longer going
back to the "original source" to pick up the latest data. The recomputation
(read "journaling") design goal is reliability, not "data refresh".

Hope that is clear. I do sympathize with a possible reading of your design
goal; we are working on perhaps a similar design goal where streaming data
deltas are automatically reflected into a data structure on which the user
has a single reference (*)

(*) yep this is based on DStream/TD's work and will be available soon.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen <[email protected]> wrote:

> Mark, that's precisely why I brought up lineage, in order to say I didn't
> want to muddy the issue there :)
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <[email protected]>wrote:
>
>> I don't agree entirely, Christopher.  Without persisting or checkpointing
>> RDDs, re-evaluation of the lineage will pick up source changes.  I'm not
>> saying that working this way is a good idea (in fact, it's generally not),
>> but you can do things like this:
>>
>> 1) Create file silliness.txt containing:
>>
>> one line
>> two line
>> red line
>> blue line
>>
>> 2) Fire up spark-shell and do this:
>>
>> scala> val lines = sc.textFile("silliness.txt")
>> scala> println(lines.collect.mkString(", "))
>> .
>> .
>> .
>> one line, two line, red line, blue line
>>
>> 3) Edit silliness.txt so that it is now:
>>
>> and now
>> for something
>> completely
>> different
>>
>> 4) Continue on with spark-shell:
>>
>> scala> println(lines.collect.mkString(", "))
>> .
>> .
>> .
>> and now, for something, completely, different
>>
>>
>> On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <[email protected]>wrote:
>>
>>> Sai, from your question, I infer that you have an interpretation that
>>> RDDs are somehow an in-memory/cached copy of the underlying data
>>> source---and so there is some expectation that there is some
>>> synchronization model between the two.
>>>
>>> That would not be what the RDD model is. RDDs are first-class,
>>> stand-alone (distributed, immutable) datasets. Once created, an RDD exists
>>> on its own and isn't expected to somehow automatically realize that some
>>> underlying source has changed. (There is the concept of lineage or
>>> provenance for recomputation of RDDs, but that's orthogonal to this
>>> interpretation so I won't muddy the issue here).
>>>
>>> If you're looking for a mutable data table model, we will soon be
>>> releasing to open source something called Distributed DataFrame (DDF, based
>>> on R's data.frame) on top of RDDs that allows you to, among other useful
>>> things, load a dataset, perform transformations on it, and save it back,
>>> all the while holding on to a single DDF reference.
>>>
>>> --
>>> Christopher T. Nguyen
>>> Co-founder & CEO, Adatao <http://adatao.com>
>>> linkedin.com/in/ctnguyen
>>>
>>>
>>>
>>> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna 
>>> <[email protected]>wrote:
>>>
>>>> Thanks Patrick, but i think i dint put my question clearly...
>>>>
>>>> The question is Say in the native file system or HDFS, i have data
>>>> describing students who passed, failed and for whom results are with-held
>>>> for some reason.
>>>> *Time T1:*
>>>> x - Pass
>>>> y - Fail
>>>> z - With-held.
>>>>
>>>> *Time T2:*
>>>> So i create an RDD1 reflecting this data, run a query to find how many
>>>> candidates have passed.
>>>> RESULT = 1. RDD1 is cached or its stored in the file system depending
>>>> on the availability of space.
>>>>
>>>> *Time T3:*
>>>> In the native file system, now that results of the z are out and
>>>> declared passed. So HDFS will need to be modified.
>>>> x - Pass
>>>> y - Fail
>>>> z - Pass.
>>>> Say now i get the RDD1 that is there in file system or cached copy and
>>>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>>>
>>>> So i was asking is there a way SPARK hints that RDD1 is no longer
>>>> consistent with the file system or that its upto the programmer to recreate
>>>> the RDD1 if the block from where RDD was created was changed at a later
>>>> point of time.
>>>> [T1 < T2 < T3 < T4]
>>>>
>>>> Thanks in advance...
>>>>
>>>>
>>>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <[email protected]>wrote:
>>>>
>>>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>>>> block in-place inside of an RDD. As a result, this particular
>>>>> consistency issue doesn't come up in Spark.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <
>>>>> [email protected]> wrote:
>>>>> > Hello, i am a novice to SPARK
>>>>> >
>>>>> > Say that we have created an RDD1 from native file system/HDFS and
>>>>> done some
>>>>> > transformations and actions and that resulted in an RDD2. Lets
>>>>> assume RDD1
>>>>> > and RDD2 are persisted, cached in-memory. If the block from where
>>>>> RDD1 was
>>>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at
>>>>> T2 > T1,
>>>>> > is there a way either SPARK ensures consistency or it is upto the
>>>>> programmer
>>>>> > to make it explicit?
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Sai Prasanna. AN*
>>>> *II M.Tech (CS), SSSIHL*
>>>>
>>>>
>>>> * Entire water in the ocean can never sink a ship, Unless it gets
>>>> inside. All the pressures of life can never hurt you, Unless you let them
>>>> in.*
>>>>
>>>
>>>
>>
>

Reply via email to