yes, it sounds like a good use of an accumulator to me val counts = sc.accumulator(0L) rdd.map{x => counts += 1 x }.saveAsObjectFile(file2)
On Mon, Mar 30, 2015 at 12:08 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Sean > > > > Yes I know that I can use persist() to persist to disk, but it is still a > big extra cost of persist a huge RDD to disk. I hope that I can do one pass > to get the count as well as rdd.saveAsObjectFile(file2), but I don’t know > how. > > > > May be use accumulator to count the total ? > > > > Ningjun > > > > *From:* Mark Hamstra [mailto:m...@clearstorydata.com] > *Sent:* Thursday, March 26, 2015 12:37 PM > *To:* Sean Owen > *Cc:* Wang, Ningjun (LNG-NPV); user@spark.apache.org > *Subject:* Re: How to get rdd count() without double evaluation of the > RDD? > > > > You can also always take the more extreme approach of using > SparkContext#runJob (or submitJob) to write a custom Action that does what > you want in one pass. Usually that's not worth the extra effort. > > > > On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen <so...@cloudera.com> wrote: > > To avoid computing twice you need to persist the RDD but that need not be > in memory. You can persist to disk with persist(). > > On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" < > ningjun.w...@lexisnexis.com> wrote: > > I have a rdd that is expensive to compute. I want to save it as object > file and also print the count. How can I avoid double computation of the > RDD? > > > > val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) > > > > val count = rdd.count() // this force computation of the rdd > > println(count) > > rdd.saveAsObjectFile(file2) // this compute the RDD again > > > > I can avoid double computation by using cache > > > > val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) > > rdd.cache() > > val count = rdd.count() > > println(count) > > rdd.saveAsObjectFile(file2) // this compute the RDD again > > > > This only compute rdd once. However the rdd has millions of items and will > cause out of memory. > > > > Question: how can I avoid double computation without using cache? > > > > > > Ningjun > > >