Re: Cheapest way to materialize an RDD?
You can also do something like rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) => { while(iter.hasNext) iter.next() }) On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen wrote: > Yeah, from an unscientific test, it looks like the time to cache the > blocks still dominates. Saving the count is probably a win, but not > big. Well, maybe good to know. > > On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch > wrote: > > Theoretically your approach would require less overhead - i.e. a collect > on > > the driver is not required as the last step. But maybe the difference is > > small and that particular path may or may not have been properly > optimized > > vs the count(). Do you have a biggish data set to compare the timings? > > > > 2015-01-30 14:42 GMT-08:00 Sean Owen : > >> > >> So far, the canonical way to materialize an RDD just to make sure it's > >> cached is to call count(). That's fine but incurs the overhead of > >> actually counting the elements. > >> > >> However, rdd.foreachPartition(p => None) for example also seems to > >> cause the RDD to be materialized, and is a no-op. Is that a better way > >> to do it or am I not thinking of why it's insufficient? > >> > >> - > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Cheapest way to materialize an RDD?
Yeah, from an unscientific test, it looks like the time to cache the blocks still dominates. Saving the count is probably a win, but not big. Well, maybe good to know. On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch wrote: > Theoretically your approach would require less overhead - i.e. a collect on > the driver is not required as the last step. But maybe the difference is > small and that particular path may or may not have been properly optimized > vs the count(). Do you have a biggish data set to compare the timings? > > 2015-01-30 14:42 GMT-08:00 Sean Owen : >> >> So far, the canonical way to materialize an RDD just to make sure it's >> cached is to call count(). That's fine but incurs the overhead of >> actually counting the elements. >> >> However, rdd.foreachPartition(p => None) for example also seems to >> cause the RDD to be materialized, and is a no-op. Is that a better way >> to do it or am I not thinking of why it's insufficient? >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Cheapest way to materialize an RDD?
Theoretically your approach would require less overhead - i.e. a collect on the driver is not required as the last step. But maybe the difference is small and that particular path may or may not have been properly optimized vs the count(). Do you have a biggish data set to compare the timings? 2015-01-30 14:42 GMT-08:00 Sean Owen : > So far, the canonical way to materialize an RDD just to make sure it's > cached is to call count(). That's fine but incurs the overhead of > actually counting the elements. > > However, rdd.foreachPartition(p => None) for example also seems to > cause the RDD to be materialized, and is a no-op. Is that a better way > to do it or am I not thinking of why it's insufficient? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Cheapest way to materialize an RDD?
So far, the canonical way to materialize an RDD just to make sure it's cached is to call count(). That's fine but incurs the overhead of actually counting the elements. However, rdd.foreachPartition(p => None) for example also seems to cause the RDD to be materialized, and is a no-op. Is that a better way to do it or am I not thinking of why it's insufficient? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org