Re: Cheapest way to materialize an RDD?

2015-02-02 Thread Raghavendra Pandey
You can also do something like
rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) => {
  while(iter.hasNext) iter.next()
})

On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen  wrote:

> Yeah, from an unscientific test, it looks like the time to cache the
> blocks still dominates. Saving the count is probably a win, but not
> big. Well, maybe good to know.
>
> On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch 
> wrote:
> > Theoretically your approach would require less overhead - i.e. a collect
> on
> > the driver is not required as the last step.  But maybe the difference is
> > small and that particular path may or may not have been properly
> optimized
> > vs the count(). Do you have a biggish data set to compare the timings?
> >
> > 2015-01-30 14:42 GMT-08:00 Sean Owen :
> >>
> >> So far, the canonical way to materialize an RDD just to make sure it's
> >> cached is to call count(). That's fine but incurs the overhead of
> >> actually counting the elements.
> >>
> >> However, rdd.foreachPartition(p => None) for example also seems to
> >> cause the RDD to be materialized, and is a no-op. Is that a better way
> >> to do it or am I not thinking of why it's insufficient?
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving the count is probably a win, but not
big. Well, maybe good to know.

On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch  wrote:
> Theoretically your approach would require less overhead - i.e. a collect on
> the driver is not required as the last step.  But maybe the difference is
> small and that particular path may or may not have been properly optimized
> vs the count(). Do you have a biggish data set to compare the timings?
>
> 2015-01-30 14:42 GMT-08:00 Sean Owen :
>>
>> So far, the canonical way to materialize an RDD just to make sure it's
>> cached is to call count(). That's fine but incurs the overhead of
>> actually counting the elements.
>>
>> However, rdd.foreachPartition(p => None) for example also seems to
>> cause the RDD to be materialized, and is a no-op. Is that a better way
>> to do it or am I not thinking of why it's insufficient?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Stephen Boesch
Theoretically your approach would require less overhead - i.e. a collect on
the driver is not required as the last step.  But maybe the difference is
small and that particular path may or may not have been properly optimized
vs the count(). Do you have a biggish data set to compare the timings?

2015-01-30 14:42 GMT-08:00 Sean Owen :

> So far, the canonical way to materialize an RDD just to make sure it's
> cached is to call count(). That's fine but incurs the overhead of
> actually counting the elements.
>
> However, rdd.foreachPartition(p => None) for example also seems to
> cause the RDD to be materialized, and is a no-op. Is that a better way
> to do it or am I not thinking of why it's insufficient?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
So far, the canonical way to materialize an RDD just to make sure it's
cached is to call count(). That's fine but incurs the overhead of
actually counting the elements.

However, rdd.foreachPartition(p => None) for example also seems to
cause the RDD to be materialized, and is a no-op. Is that a better way
to do it or am I not thinking of why it's insufficient?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org