Christorpher, things are much clear now. I did go through Journaling. Thanks...
On Fri, Jan 17, 2014 at 8:59 PM, Christopher Nguyen <[email protected]> wrote: > Sai, to be sure, what Mark said regarding lineage and recomputation is > exactly correct, so if it matters in your use case, you shouldn't ignore > this behavior, even as a side effect. > > It just isn't what I think you were expecting in terms of RDD guarantees, > e.g., somehow there is a signal sent to your driver or workers that the > "original source" has changed. Further, there are no guarantees that Spark > hasn't decided to checkpoint the lineage somewhere and is no longer going > back to the "original source" to pick up the latest data. The recomputation > (read "journaling") design goal is reliability, not "data refresh". > > Hope that is clear. I do sympathize with a possible reading of your design > goal; we are working on perhaps a similar design goal where streaming data > deltas are automatically reflected into a data structure on which the user > has a single reference (*) > > (*) yep this is based on DStream/TD's work and will be available soon. > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen <[email protected]>wrote: > >> Mark, that's precisely why I brought up lineage, in order to say I didn't >> want to muddy the issue there :) >> >> -- >> Christopher T. Nguyen >> Co-founder & CEO, Adatao <http://adatao.com> >> linkedin.com/in/ctnguyen >> >> >> >> On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <[email protected]>wrote: >> >>> I don't agree entirely, Christopher. Without persisting or >>> checkpointing RDDs, re-evaluation of the lineage will pick up source >>> changes. I'm not saying that working this way is a good idea (in fact, >>> it's generally not), but you can do things like this: >>> >>> 1) Create file silliness.txt containing: >>> >>> one line >>> two line >>> red line >>> blue line >>> >>> 2) Fire up spark-shell and do this: >>> >>> scala> val lines = sc.textFile("silliness.txt") >>> scala> println(lines.collect.mkString(", ")) >>> . >>> . >>> . >>> one line, two line, red line, blue line >>> >>> 3) Edit silliness.txt so that it is now: >>> >>> and now >>> for something >>> completely >>> different >>> >>> 4) Continue on with spark-shell: >>> >>> scala> println(lines.collect.mkString(", ")) >>> . >>> . >>> . >>> and now, for something, completely, different >>> >>> >>> On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <[email protected]>wrote: >>> >>>> Sai, from your question, I infer that you have an interpretation that >>>> RDDs are somehow an in-memory/cached copy of the underlying data >>>> source---and so there is some expectation that there is some >>>> synchronization model between the two. >>>> >>>> That would not be what the RDD model is. RDDs are first-class, >>>> stand-alone (distributed, immutable) datasets. Once created, an RDD exists >>>> on its own and isn't expected to somehow automatically realize that some >>>> underlying source has changed. (There is the concept of lineage or >>>> provenance for recomputation of RDDs, but that's orthogonal to this >>>> interpretation so I won't muddy the issue here). >>>> >>>> If you're looking for a mutable data table model, we will soon be >>>> releasing to open source something called Distributed DataFrame (DDF, based >>>> on R's data.frame) on top of RDDs that allows you to, among other useful >>>> things, load a dataset, perform transformations on it, and save it back, >>>> all the while holding on to a single DDF reference. >>>> >>>> -- >>>> Christopher T. Nguyen >>>> Co-founder & CEO, Adatao <http://adatao.com> >>>> linkedin.com/in/ctnguyen >>>> >>>> >>>> >>>> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna >>>> <[email protected]>wrote: >>>> >>>>> Thanks Patrick, but i think i dint put my question clearly... >>>>> >>>>> The question is Say in the native file system or HDFS, i have data >>>>> describing students who passed, failed and for whom results are with-held >>>>> for some reason. >>>>> *Time T1:* >>>>> x - Pass >>>>> y - Fail >>>>> z - With-held. >>>>> >>>>> *Time T2:* >>>>> So i create an RDD1 reflecting this data, run a query to find how many >>>>> candidates have passed. >>>>> RESULT = 1. RDD1 is cached or its stored in the file system depending >>>>> on the availability of space. >>>>> >>>>> *Time T3:* >>>>> In the native file system, now that results of the z are out and >>>>> declared passed. So HDFS will need to be modified. >>>>> x - Pass >>>>> y - Fail >>>>> z - Pass. >>>>> Say now i get the RDD1 that is there in file system or cached copy and >>>>> run the same query, i get the RESULT = 1, but ideally RESULT is 2. >>>>> >>>>> So i was asking is there a way SPARK hints that RDD1 is no longer >>>>> consistent with the file system or that its upto the programmer to >>>>> recreate >>>>> the RDD1 if the block from where RDD was created was changed at a later >>>>> point of time. >>>>> [T1 < T2 < T3 < T4] >>>>> >>>>> Thanks in advance... >>>>> >>>>> >>>>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell >>>>> <[email protected]>wrote: >>>>> >>>>>> RDD's are immutable, so there isn't really such a thing as modifying a >>>>>> block in-place inside of an RDD. As a result, this particular >>>>>> consistency issue doesn't come up in Spark. >>>>>> >>>>>> - Patrick >>>>>> >>>>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna < >>>>>> [email protected]> wrote: >>>>>> > Hello, i am a novice to SPARK >>>>>> > >>>>>> > Say that we have created an RDD1 from native file system/HDFS and >>>>>> done some >>>>>> > transformations and actions and that resulted in an RDD2. Lets >>>>>> assume RDD1 >>>>>> > and RDD2 are persisted, cached in-memory. If the block from where >>>>>> RDD1 was >>>>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at >>>>>> T2 > T1, >>>>>> > is there a way either SPARK ensures consistency or it is upto the >>>>>> programmer >>>>>> > to make it explicit? >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html >>>>>> > Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Sai Prasanna. AN* >>>>> *II M.Tech (CS), SSSIHL* >>>>> >>>>> >>>>> * Entire water in the ocean can never sink a ship, Unless it gets >>>>> inside. All the pressures of life can never hurt you, Unless you let them >>>>> in.* >>>>> >>>> >>>> >>> >> > -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*
