Hello,
As I have written my own data source, I also wrote a custom RDD[Row]
implementation to provide getPartitions and compute overrides.
This works very well but doing some performance analysis, I see that for
any given pipeline fit operation, a fair amount of time is spent in the
RDD.count
Hello all,
I worked around this for now using the class (that I already had) that
inherits from RDD and is the one all of our custom RDDs inherit from. I did
the following:
1) Overload all of the transformations (that get used in our app) that don't
change the RDD size wrapping the results with a
I think the worry here is that people often use count() to force execution,
and when coupled with transformations with side-effect, it is no longer
safe to not run it.
However, maybe we can add a new lazy val .size that doesn't require
recomputation.
On Sat, Mar 28, 2015 at 7:42 AM, Sandy Ryza
I definitely see the value in this. However, I think at this point it
would be an incompatible behavioral change. People often use count in
Spark to exercise their DAG. Omitting processing steps that were
previously included would likely mislead many users into thinking their
pipeline was runnin
No, I'm not saying side effects change the count. But not executing
the map() function at all certainly has an effect on the side effects
of that function: the side effects which should take place never do. I
am not sure that is something to be 'fixed'; it's a legitimate
question.
You can persist
Hi Sean,
Thanks for the response.
I can't imagine a case (though my imagination may be somewhat limited) where
even map side effects could change the number of elements in the resulting
map.
I guess "count" wouldn't officially be an 'action' if it were implemented
this way. At least it wouldn't
I was wondering why the RDD.count call recomputes the RDD in all cases? In
> most cases it can simply ask the next dependent RDD. I have several RDD
> implementations and was surprised to see a call like the following never
> call my RDD's count method but instead recompute/traverse
Hi all,
I was wondering why the RDD.count call recomputes the RDD in all cases? In
most cases it can simply ask the next dependent RDD. I have several RDD
implementations and was surprised to see a call like the following never
call my RDD's count method but instead recompute/traverse the e