here are the threads that talk about problems we're experiencing. These problems exacerbate when we use cache/persist
https://www.mail-archive.com/user@spark.apache.org/msg64987.html https://www.mail-archive.com/user@spark.apache.org/msg64986.html So I am looking for a way to reproduce the same effect as in my sample code without the use of cache(). If I use myrdd.count() would that be a good alternative? thanks ________________________________ From: lucas.g...@gmail.com <lucas.g...@gmail.com> Sent: Tuesday, August 1, 2017 11:23:04 AM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache Hi Jeff, that looks sane to me. Do you have additional details? On 1 August 2017 at 11:05, jeff saremi <jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote: Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope in finding a solution. So I'd like to find a workaround for that: If I save an RDD to hdfs and read it back, can I use it in more than one operation? Example: (using cache) // do a whole bunch of transformations on an RDD myrdd.cache() val result1 = myrdd.map(op1(_)) val result2 = myrdd.map(op2(_)) // in the above I am assuming that a call to cache will prevent all previous transformation from being calculated twice I'd like to somehow get result1 and result2 without duplicating work. How can I do that? thanks Jeff