Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right?
-----原始邮件----- 发件人: "hequn cheng" <chenghe...@gmail.com> 发送时间: 2014/3/25 9:35 收件人: "user@spark.apache.org" <user@spark.apache.org> 主题: Re: RDD usage points.foreach(p=>p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen <r01944...@csie.ntu.edu.tw>: Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable { var y : Double = input_y var x : Array[Double] = input_x ...... } Furthermore, I created the RDD by the following function. def parsePoint(line: String): AppDataPoint = { /* Some related works for parsing */ ...... } Assume the RDD called "points": val lines = sc.textFile(inputPath, numPartition) var points = lines.map(parsePoint _).cache() The question is that, I tried to modify the value of this RDD, the operation is: points.foreach(p=>p.y = another_value) The operation is workable. There doesn't have any warning or error message showed by the system and the results are right. I wonder that if the modification for RDD is a correct and in fact workable design. The usage web said that the RDD is immutable, is there any suggestion? Thanks a lot. Chieh-Yen Lin