More efficient RDD.count() implementation

2017-07-12 Thread OBones
Hello, As I have written my own data source, I also wrote a custom RDD[Row] implementation to provide getPartitions and compute overrides. This works very well but doing some performance analysis, I see that for any given pipeline fit operation, a fair amount of time is spent in the RDD.count

Re: RDD.count

2015-03-28 Thread jimfcarroll
Hello all, I worked around this for now using the class (that I already had) that inherits from RDD and is the one all of our custom RDDs inherit from. I did the following: 1) Overload all of the transformations (that get used in our app) that don't change the RDD size wrapping the results with a

Re: RDD.count

2015-03-28 Thread Reynold Xin
I think the worry here is that people often use count() to force execution, and when coupled with transformations with side-effect, it is no longer safe to not run it. However, maybe we can add a new lazy val .size that doesn't require recomputation. On Sat, Mar 28, 2015 at 7:42 AM, Sandy Ryza

Re: RDD.count

2015-03-28 Thread Sandy Ryza
I definitely see the value in this. However, I think at this point it would be an incompatible behavioral change. People often use count in Spark to exercise their DAG. Omitting processing steps that were previously included would likely mislead many users into thinking their pipeline was runnin

Re: RDD.count

2015-03-28 Thread Sean Owen
No, I'm not saying side effects change the count. But not executing the map() function at all certainly has an effect on the side effects of that function: the side effects which should take place never do. I am not sure that is something to be 'fixed'; it's a legitimate question. You can persist

Re: RDD.count

2015-03-28 Thread jimfcarroll
Hi Sean, Thanks for the response. I can't imagine a case (though my imagination may be somewhat limited) where even map side effects could change the number of elements in the resulting map. I guess "count" wouldn't officially be an 'action' if it were implemented this way. At least it wouldn't

Re: RDD.count

2015-03-27 Thread Sean Owen
I was wondering why the RDD.count call recomputes the RDD in all cases? In > most cases it can simply ask the next dependent RDD. I have several RDD > implementations and was surprised to see a call like the following never > call my RDD's count method but instead recompute/traverse

RDD.count

2015-03-27 Thread jimfcarroll
Hi all, I was wondering why the RDD.count call recomputes the RDD in all cases? In most cases it can simply ask the next dependent RDD. I have several RDD implementations and was surprised to see a call like the following never call my RDD's count method but instead recompute/traverse the e