Re: Consistency between RDD's and Native File System

Sai Prasanna Fri, 17 Jan 2014 08:26:59 -0800

Christorpher, things are much clear now. I did go through Journaling.

Thanks...



On Fri, Jan 17, 2014 at 8:59 PM, Christopher Nguyen <[email protected]> wrote:

> Sai, to be sure, what Mark said regarding lineage and recomputation is
> exactly correct, so if it matters in your use case, you shouldn't ignore
> this behavior, even as a side effect.
>
> It just isn't what I think you were expecting in terms of RDD guarantees,
> e.g., somehow there is a signal sent to your driver or workers that the
> "original source" has changed. Further, there are no guarantees that Spark
> hasn't decided to checkpoint the lineage somewhere and is no longer going
> back to the "original source" to pick up the latest data. The recomputation
> (read "journaling") design goal is reliability, not "data refresh".
>
> Hope that is clear. I do sympathize with a possible reading of your design
> goal; we are working on perhaps a similar design goal where streaming data
> deltas are automatically reflected into a data structure on which the user
> has a single reference (*)
>
> (*) yep this is based on DStream/TD's work and will be available soon.
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen <[email protected]>wrote:
>
>> Mark, that's precisely why I brought up lineage, in order to say I didn't
>> want to muddy the issue there :)
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <[email protected]>wrote:
>>
>>> I don't agree entirely, Christopher.  Without persisting or
>>> checkpointing RDDs, re-evaluation of the lineage will pick up source
>>> changes.  I'm not saying that working this way is a good idea (in fact,
>>> it's generally not), but you can do things like this:
>>>
>>> 1) Create file silliness.txt containing:
>>>
>>> one line
>>> two line
>>> red line
>>> blue line
>>>
>>> 2) Fire up spark-shell and do this:
>>>
>>> scala> val lines = sc.textFile("silliness.txt")
>>> scala> println(lines.collect.mkString(", "))
>>> .
>>> .
>>> .
>>> one line, two line, red line, blue line
>>>
>>> 3) Edit silliness.txt so that it is now:
>>>
>>> and now
>>> for something
>>> completely
>>> different
>>>
>>> 4) Continue on with spark-shell:
>>>
>>> scala> println(lines.collect.mkString(", "))
>>> .
>>> .
>>> .
>>> and now, for something, completely, different
>>>
>>>
>>> On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <[email protected]>wrote:
>>>
>>>> Sai, from your question, I infer that you have an interpretation that
>>>> RDDs are somehow an in-memory/cached copy of the underlying data
>>>> source---and so there is some expectation that there is some
>>>> synchronization model between the two.
>>>>
>>>> That would not be what the RDD model is. RDDs are first-class,
>>>> stand-alone (distributed, immutable) datasets. Once created, an RDD exists
>>>> on its own and isn't expected to somehow automatically realize that some
>>>> underlying source has changed. (There is the concept of lineage or
>>>> provenance for recomputation of RDDs, but that's orthogonal to this
>>>> interpretation so I won't muddy the issue here).
>>>>
>>>> If you're looking for a mutable data table model, we will soon be
>>>> releasing to open source something called Distributed DataFrame (DDF, based
>>>> on R's data.frame) on top of RDDs that allows you to, among other useful
>>>> things, load a dataset, perform transformations on it, and save it back,
>>>> all the while holding on to a single DDF reference.
>>>>
>>>> --
>>>> Christopher T. Nguyen
>>>> Co-founder & CEO, Adatao <http://adatao.com>
>>>> linkedin.com/in/ctnguyen
>>>>
>>>>
>>>>
>>>> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna 
>>>> <[email protected]>wrote:
>>>>
>>>>> Thanks Patrick, but i think i dint put my question clearly...
>>>>>
>>>>> The question is Say in the native file system or HDFS, i have data
>>>>> describing students who passed, failed and for whom results are with-held
>>>>> for some reason.
>>>>> *Time T1:*
>>>>> x - Pass
>>>>> y - Fail
>>>>> z - With-held.
>>>>>
>>>>> *Time T2:*
>>>>> So i create an RDD1 reflecting this data, run a query to find how many
>>>>> candidates have passed.
>>>>> RESULT = 1. RDD1 is cached or its stored in the file system depending
>>>>> on the availability of space.
>>>>>
>>>>> *Time T3:*
>>>>> In the native file system, now that results of the z are out and
>>>>> declared passed. So HDFS will need to be modified.
>>>>> x - Pass
>>>>> y - Fail
>>>>> z - Pass.
>>>>> Say now i get the RDD1 that is there in file system or cached copy and
>>>>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>>>>
>>>>> So i was asking is there a way SPARK hints that RDD1 is no longer
>>>>> consistent with the file system or that its upto the programmer to 
>>>>> recreate
>>>>> the RDD1 if the block from where RDD was created was changed at a later
>>>>> point of time.
>>>>> [T1 < T2 < T3 < T4]
>>>>>
>>>>> Thanks in advance...
>>>>>
>>>>>
>>>>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>>>>> block in-place inside of an RDD. As a result, this particular
>>>>>> consistency issue doesn't come up in Spark.
>>>>>>
>>>>>> - Patrick
>>>>>>
>>>>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <
>>>>>> [email protected]> wrote:
>>>>>> > Hello, i am a novice to SPARK
>>>>>> >
>>>>>> > Say that we have created an RDD1 from native file system/HDFS and
>>>>>> done some
>>>>>> > transformations and actions and that resulted in an RDD2. Lets
>>>>>> assume RDD1
>>>>>> > and RDD2 are persisted, cached in-memory. If the block from where
>>>>>> RDD1 was
>>>>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at
>>>>>> T2 > T1,
>>>>>> > is there a way either SPARK ensures consistency or it is upto the
>>>>>> programmer
>>>>>> > to make it explicit?
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Sai Prasanna. AN*
>>>>> *II M.Tech (CS), SSSIHL*
>>>>>
>>>>>
>>>>> * Entire water in the ocean can never sink a ship, Unless it gets
>>>>> inside. All the pressures of life can never hurt you, Unless you let them
>>>>> in.*
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*

Re: Consistency between RDD's and Native File System

Reply via email to