I think there are a number of misconceptions here. It is not necessary that the original node come back in order to recreate the lost partition. The lineage is not retrieved from neighboring nodes. The source data is retrieved in the same way that it was the first time that the partition was computed. The caller does not need to do anything; Spark does the recomputation. The point is that the creation of the partition is deterministic and so can be replayed anywhere. Spark *can* replicate RDDs, optionally. Resilience of data stored on HDFS is up to HDFS and is transparent to Spark. Spark will use the data locality information to try to schedule work next to the data, no matter what the replication factor. More replication potentially allows more options in scheduling tasks, I suppose, since the data is found on more nodes.
On Fri, Feb 6, 2015 at 9:47 AM, Kartheek.R <kartheek.m...@gmail.com> wrote: > Hi, > > I have this doubt: Assume that an rdd is stored across multiple nodes and > one of the nodes fails. So, a partition is lost. Now, I know that when this > node is back, it uses the lineage from its neighbours and recomputes that > partition alone. > > 1) How does it get the source data (original data before applying any > transformations) that is lost during the crash. Is it our responsibility to > get back the source data before using the lineage?. We have only lineage > stored on other nodes. > > 2)Suppose the underlying HDFS deploys replication factor =3. We know that > spark doesn't replicate RDD. When a partition is lost, is there a > possibility to use the second copy of the original data stored in HDFS and > generate the required partition using lineage from other nodes?. > > 3)Does it make any difference to spark if HDFS replicates its blocks more > that once? > > Can someone please enlighten me on these fundamentals? > > Thank you > > ________________________________ > View this message in context: Question about recomputing lost partition of > rdd ? > Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org