Thanks, Sean! Yes, I agree that this logging would still have some cost and
so would not be used in production.
On Sat, Feb 21, 2015 at 1:37 AM, Sean Owen wrote:
> I think the cheapest possible way to force materialization is something
> like
>
> rdd.foreachPartition(i => None)
>
> I get the use case, but as you can see there is a cost: you are forced
> to materialize an RDD and cache it just to measure the computation
> time. In principle this could be taking significantly more time than
> not doing so, since otherwise several RDD stages might proceed without
> ever even having to persist intermediate results in memory.
>
> Consider looking at the Spark UI to see how much time a stage took,
> although it's measuring end to end wall clock time, which may overlap
> with other computations.
>
> (or maybe you are disabling / enabling this logging for prod / test anyway)
>
> On Sat, Feb 21, 2015 at 4:46 AM, pnpritchard
> wrote:
> > Is there a technique for forcing the evaluation of an RDD?
> >
> > I have used actions to do so but even the most basic "count" has a
> > non-negligible cost (even on a cached RDD, repeated calls to count take
> > time).
> >
> > My use case is for logging the execution time of the major components in
> my
> > application. At the end of each component I have a statement like
> > "rdd.cache().count()" and time how long it takes.
> >
> > Thanks in advance for any advice!
> > Nick
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Force-RDD-evaluation-tp21748.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>