You can also do something like
rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) => {
while(iter.hasNext) iter.next()
})
On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen <[email protected]> wrote:
> Yeah, from an unscientific test, it looks like the time to cache the
> blocks still dominates. Saving the count is probably a win, but not
> big. Well, maybe good to know.
>
> On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch <[email protected]>
> wrote:
> > Theoretically your approach would require less overhead - i.e. a collect
> on
> > the driver is not required as the last step. But maybe the difference is
> > small and that particular path may or may not have been properly
> optimized
> > vs the count(). Do you have a biggish data set to compare the timings?
> >
> > 2015-01-30 14:42 GMT-08:00 Sean Owen <[email protected]>:
> >>
> >> So far, the canonical way to materialize an RDD just to make sure it's
> >> cached is to call count(). That's fine but incurs the overhead of
> >> actually counting the elements.
> >>
> >> However, rdd.foreachPartition(p => None) for example also seems to
> >> cause the RDD to be materialized, and is a no-op. Is that a better way
> >> to do it or am I not thinking of why it's insufficient?
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>