Mark, that's precisely why I brought up lineage, in order to say I didn't want to muddy the issue there :)
-- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Thu, Jan 16, 2014 at 9:09 PM, Mark Hamstra <[email protected]>wrote: > I don't agree entirely, Christopher. Without persisting or checkpointing > RDDs, re-evaluation of the lineage will pick up source changes. I'm not > saying that working this way is a good idea (in fact, it's generally not), > but you can do things like this: > > 1) Create file silliness.txt containing: > > one line > two line > red line > blue line > > 2) Fire up spark-shell and do this: > > scala> val lines = sc.textFile("silliness.txt") > scala> println(lines.collect.mkString(", ")) > . > . > . > one line, two line, red line, blue line > > 3) Edit silliness.txt so that it is now: > > and now > for something > completely > different > > 4) Continue on with spark-shell: > > scala> println(lines.collect.mkString(", ")) > . > . > . > and now, for something, completely, different > > > On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <[email protected]>wrote: > >> Sai, from your question, I infer that you have an interpretation that >> RDDs are somehow an in-memory/cached copy of the underlying data >> source---and so there is some expectation that there is some >> synchronization model between the two. >> >> That would not be what the RDD model is. RDDs are first-class, >> stand-alone (distributed, immutable) datasets. Once created, an RDD exists >> on its own and isn't expected to somehow automatically realize that some >> underlying source has changed. (There is the concept of lineage or >> provenance for recomputation of RDDs, but that's orthogonal to this >> interpretation so I won't muddy the issue here). >> >> If you're looking for a mutable data table model, we will soon be >> releasing to open source something called Distributed DataFrame (DDF, based >> on R's data.frame) on top of RDDs that allows you to, among other useful >> things, load a dataset, perform transformations on it, and save it back, >> all the while holding on to a single DDF reference. >> >> -- >> Christopher T. Nguyen >> Co-founder & CEO, Adatao <http://adatao.com> >> linkedin.com/in/ctnguyen >> >> >> >> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <[email protected]>wrote: >> >>> Thanks Patrick, but i think i dint put my question clearly... >>> >>> The question is Say in the native file system or HDFS, i have data >>> describing students who passed, failed and for whom results are with-held >>> for some reason. >>> *Time T1:* >>> x - Pass >>> y - Fail >>> z - With-held. >>> >>> *Time T2:* >>> So i create an RDD1 reflecting this data, run a query to find how many >>> candidates have passed. >>> RESULT = 1. RDD1 is cached or its stored in the file system depending on >>> the availability of space. >>> >>> *Time T3:* >>> In the native file system, now that results of the z are out and >>> declared passed. So HDFS will need to be modified. >>> x - Pass >>> y - Fail >>> z - Pass. >>> Say now i get the RDD1 that is there in file system or cached copy and >>> run the same query, i get the RESULT = 1, but ideally RESULT is 2. >>> >>> So i was asking is there a way SPARK hints that RDD1 is no longer >>> consistent with the file system or that its upto the programmer to recreate >>> the RDD1 if the block from where RDD was created was changed at a later >>> point of time. >>> [T1 < T2 < T3 < T4] >>> >>> Thanks in advance... >>> >>> >>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <[email protected]>wrote: >>> >>>> RDD's are immutable, so there isn't really such a thing as modifying a >>>> block in-place inside of an RDD. As a result, this particular >>>> consistency issue doesn't come up in Spark. >>>> >>>> - Patrick >>>> >>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <[email protected]> >>>> wrote: >>>> > Hello, i am a novice to SPARK >>>> > >>>> > Say that we have created an RDD1 from native file system/HDFS and >>>> done some >>>> > transformations and actions and that resulted in an RDD2. Lets assume >>>> RDD1 >>>> > and RDD2 are persisted, cached in-memory. If the block from where >>>> RDD1 was >>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2 >>>> > T1, >>>> > is there a way either SPARK ensures consistency or it is upto the >>>> programmer >>>> > to make it explicit? >>>> > >>>> > >>>> > >>>> > -- >>>> > View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html >>>> > Sent from the Apache Spark User List mailing list archive at >>>> Nabble.com. >>>> >>> >>> >>> >>> -- >>> *Sai Prasanna. AN* >>> *II M.Tech (CS), SSSIHL* >>> >>> >>> * Entire water in the ocean can never sink a ship, Unless it gets >>> inside. All the pressures of life can never hurt you, Unless you let them >>> in.* >>> >> >> >
