Re: How to get rdd count() without double evaluation of the RDD?

Imran Rashid Mon, 13 Apr 2015 19:31:14 -0700

yes, it sounds like a good use of an accumulator to me

val counts = sc.accumulator(0L)
rdd.map{x =>
  counts += 1
  x
}.saveAsObjectFile(file2)



On Mon, Mar 30, 2015 at 12:08 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  Sean
>
>
>
> Yes I know that I can use persist() to persist to disk, but it is still a
> big extra cost of persist a huge RDD to disk. I hope that I can do one pass
> to get the count as well as rdd.saveAsObjectFile(file2), but I don’t know
> how.
>
>
>
> May be use accumulator to count the total ?
>
>
>
> Ningjun
>
>
>
> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
> *Sent:* Thursday, March 26, 2015 12:37 PM
> *To:* Sean Owen
> *Cc:* Wang, Ningjun (LNG-NPV); user@spark.apache.org
> *Subject:* Re: How to get rdd count() without double evaluation of the
> RDD?
>
>
>
> You can also always take the more extreme approach of using
> SparkContext#runJob (or submitJob) to write a custom Action that does what
> you want in one pass.  Usually that's not worth the extra effort.
>
>
>
> On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen <so...@cloudera.com> wrote:
>
> To avoid computing twice you need to persist the RDD but that need not be
> in memory. You can persist to disk with persist().
>
> On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" <
> ningjun.w...@lexisnexis.com> wrote:
>
> I have a rdd that is expensive to compute. I want to save it as object
> file and also print the count. How can I avoid double computation of the
> RDD?
>
>
>
> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))
>
>
>
> val count = rdd.count()  // this force computation of the rdd
>
> println(count)
>
> rdd.saveAsObjectFile(file2) // this compute the RDD again
>
>
>
> I can avoid double computation by using cache
>
>
>
> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))
>
> rdd.cache()
>
> val count = rdd.count()
>
> println(count)
>
> rdd.saveAsObjectFile(file2) // this compute the RDD again
>
>
>
> This only compute rdd once. However the rdd has millions of items and will
> cause out of memory.
>
>
>
> Question: how can I avoid double computation without using cache?
>
>
>
>
>
> Ningjun
>
>
>

Re: How to get rdd count() without double evaluation of the RDD?

Reply via email to