Re: Force RDD evaluation

2015-02-23 Thread Nicholas Pritchard
Thanks, Sean! Yes, I agree that this logging would still have some cost and
so would not be used in production.

On Sat, Feb 21, 2015 at 1:37 AM, Sean Owen  wrote:

> I think the cheapest possible way to force materialization is something
> like
>
> rdd.foreachPartition(i => None)
>
> I get the use case, but as you can see there is a cost: you are forced
> to materialize an RDD and cache it just to measure the computation
> time. In principle this could be taking significantly more time than
> not doing so, since otherwise several RDD stages might proceed without
> ever even having to persist intermediate results in memory.
>
> Consider looking at the Spark UI to see how much time a stage took,
> although it's measuring end to end wall clock time, which may overlap
> with other computations.
>
> (or maybe you are disabling / enabling this logging for prod / test anyway)
>
> On Sat, Feb 21, 2015 at 4:46 AM, pnpritchard
>  wrote:
> > Is there a technique for forcing the evaluation of an RDD?
> >
> > I have used actions to do so but even the most basic "count" has a
> > non-negligible cost (even on a cached RDD, repeated calls to count take
> > time).
> >
> > My use case is for logging the execution time of the major components in
> my
> > application. At the end of each component I have a statement like
> > "rdd.cache().count()" and time how long it takes.
> >
> > Thanks in advance for any advice!
> > Nick
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Force-RDD-evaluation-tp21748.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: Force RDD evaluation

2015-02-21 Thread Sean Owen
I think the cheapest possible way to force materialization is something like

rdd.foreachPartition(i => None)

I get the use case, but as you can see there is a cost: you are forced
to materialize an RDD and cache it just to measure the computation
time. In principle this could be taking significantly more time than
not doing so, since otherwise several RDD stages might proceed without
ever even having to persist intermediate results in memory.

Consider looking at the Spark UI to see how much time a stage took,
although it's measuring end to end wall clock time, which may overlap
with other computations.

(or maybe you are disabling / enabling this logging for prod / test anyway)

On Sat, Feb 21, 2015 at 4:46 AM, pnpritchard
 wrote:
> Is there a technique for forcing the evaluation of an RDD?
>
> I have used actions to do so but even the most basic "count" has a
> non-negligible cost (even on a cached RDD, repeated calls to count take
> time).
>
> My use case is for logging the execution time of the major components in my
> application. At the end of each component I have a statement like
> "rdd.cache().count()" and time how long it takes.
>
> Thanks in advance for any advice!
> Nick
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Force-RDD-evaluation-tp21748.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Force RDD evaluation

2015-02-20 Thread pnpritchard
Is there a technique for forcing the evaluation of an RDD?

I have used actions to do so but even the most basic "count" has a
non-negligible cost (even on a cached RDD, repeated calls to count take
time).

My use case is for logging the execution time of the major components in my
application. At the end of each component I have a statement like
"rdd.cache().count()" and time how long it takes.

Thanks in advance for any advice!
Nick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Force-RDD-evaluation-tp21748.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org