Wow, got clarity. Thanks Christopher !!
On Fri, Jan 17, 2014 at 9:23 AM, Christopher Nguyen <[email protected]> wrote: > Sai, from your question, I infer that you have an interpretation that RDDs > are somehow an in-memory/cached copy of the underlying data source---and so > there is some expectation that there is some synchronization model between > the two. > > That would not be what the RDD model is. RDDs are first-class, stand-alone > (distributed, immutable) datasets. Once created, an RDD exists on its own > and isn't expected to somehow automatically realize that some underlying > source has changed. (There is the concept of lineage or provenance for > recomputation of RDDs, but that's orthogonal to this interpretation so I > won't muddy the issue here). > > If you're looking for a mutable data table model, we will soon be > releasing to open source something called Distributed DataFrame (DDF, based > on R's data.frame) on top of RDDs that allows you to, among other useful > things, load a dataset, perform transformations on it, and save it back, > all the while holding on to a single DDF reference. > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <[email protected]>wrote: > >> Thanks Patrick, but i think i dint put my question clearly... >> >> The question is Say in the native file system or HDFS, i have data >> describing students who passed, failed and for whom results are with-held >> for some reason. >> *Time T1:* >> x - Pass >> y - Fail >> z - With-held. >> >> *Time T2:* >> So i create an RDD1 reflecting this data, run a query to find how many >> candidates have passed. >> RESULT = 1. RDD1 is cached or its stored in the file system depending on >> the availability of space. >> >> *Time T3:* >> In the native file system, now that results of the z are out and declared >> passed. So HDFS will need to be modified. >> x - Pass >> y - Fail >> z - Pass. >> Say now i get the RDD1 that is there in file system or cached copy and >> run the same query, i get the RESULT = 1, but ideally RESULT is 2. >> >> So i was asking is there a way SPARK hints that RDD1 is no longer >> consistent with the file system or that its upto the programmer to recreate >> the RDD1 if the block from where RDD was created was changed at a later >> point of time. >> [T1 < T2 < T3 < T4] >> >> Thanks in advance... >> >> >> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <[email protected]>wrote: >> >>> RDD's are immutable, so there isn't really such a thing as modifying a >>> block in-place inside of an RDD. As a result, this particular >>> consistency issue doesn't come up in Spark. >>> >>> - Patrick >>> >>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <[email protected]> >>> wrote: >>> > Hello, i am a novice to SPARK >>> > >>> > Say that we have created an RDD1 from native file system/HDFS and done >>> some >>> > transformations and actions and that resulted in an RDD2. Lets assume >>> RDD1 >>> > and RDD2 are persisted, cached in-memory. If the block from where RDD1 >>> was >>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2 >>> > T1, >>> > is there a way either SPARK ensures consistency or it is upto the >>> programmer >>> > to make it explicit? >>> > >>> > >>> > >>> > -- >>> > View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html >>> > Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >> >> >> >> -- >> *Sai Prasanna. AN* >> *II M.Tech (CS), SSSIHL* >> >> >> * Entire water in the ocean can never sink a ship, Unless it gets inside. >> All the pressures of life can never hurt you, Unless you let them in.* >> > > -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*
