Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 PM, "Ted Yu" <yuzhih...@gmail.com> wrote:
> bq. whether or not rdd1 is a cached rdd > > RDD has getStorageLevel method which would return the RDD's current > storage level. > > SparkContext has this method: > * Return information about what RDDs are cached, if they are in mem or > on disk, how much space > * they take, etc. > */ > @DeveloperApi > def getRDDStorageInfo: Array[RDDInfo] = { > > Cheers > > On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet <cjno...@gmail.com> wrote: > >> Zhan, >> >> This is exactly what I'm trying to do except, as I metnioned in my first >> message, I am being given rdd1 and rdd2 only and I don't necessarily know >> at that point whether or not rdd1 is a cached rdd. Further, I don't know at >> that point whether or not rdd2 depends on rdd1. >> >> On Thu, Feb 26, 2015 at 6:54 PM, Zhan Zhang <zzh...@hortonworks.com> >> wrote: >> >>> In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to >>> finish probably due to writing to hdfs. a walk around for this particular >>> case may be as follows. >>> >>> val rdd1 = ......cache() >>> >>> val rdd2 = rdd1.map().....() >>> rdd1.count >>> future { rdd1.saveAsHasoopFile(...) } >>> future { rdd2.saveAsHadoopFile(…)] >>> >>> In this way, rdd1 will be calculated once, and two saveAsHadoopFile >>> will happen concurrently. >>> >>> Thanks. >>> >>> Zhan Zhang >>> >>> >>> >>> On Feb 26, 2015, at 3:28 PM, Corey Nolet <cjno...@gmail.com> wrote: >>> >>> > What confused me is the statement of *"The final result is that >>> rdd1 is calculated twice.” *Is it the expected behavior? >>> >>> To be perfectly honest, performing an action on a cached RDD in two >>> different threads and having them (at the partition level) block until the >>> parent are cached would be the behavior and myself and all my coworkers >>> expected. >>> >>> On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet <cjno...@gmail.com> wrote: >>> >>>> I should probably mention that my example case is much over >>>> simplified- Let's say I've got a tree, a fairly complex one where I begin a >>>> series of jobs at the root which calculates a bunch of really really >>>> complex joins and as I move down the tree, I'm creating reports from the >>>> data that's already been joined (i've implemented logic to determine when >>>> cached items can be cleaned up, e.g. the last report has been done in a >>>> subtree). >>>> >>>> My issue is that the 'actions' on the rdds are currently being >>>> implemented in a single thread- even if I'm waiting on a cache to complete >>>> fully before I run the "children" jobs, I'm still in a better placed than I >>>> was because I'm able to run those jobs concurrently- right now this is not >>>> the case. >>>> >>>> > What you want is for a request for partition X to wait if partition X >>>> is already being calculated in a persisted RDD. >>>> >>>> I totally agree and if I could get it so that it's waiting at the >>>> granularity of the partition, I'd be in a much much better place. I feel >>>> like I'm going down a rabbit hole and working against the Spark API. >>>> >>>> >>>> On Thu, Feb 26, 2015 at 6:03 PM, Sean Owen <so...@cloudera.com> wrote: >>>> >>>>> To distill this a bit further, I don't think you actually want rdd2 to >>>>> wait on rdd1 in this case. What you want is for a request for >>>>> partition X to wait if partition X is already being calculated in a >>>>> persisted RDD. Otherwise the first partition of rdd2 waits on the >>>>> final partition of rdd1 even when the rest is ready. >>>>> >>>>> That is probably usually a good idea in almost all cases. That much, I >>>>> don't know how hard it is to implement. But I speculate that it's >>>>> easier to deal with it at that level than as a function of the >>>>> dependency graph. >>>>> >>>>> On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet <cjno...@gmail.com> >>>>> wrote: >>>>> > I'm trying to do the scheduling myself now- to determine that rdd2 >>>>> depends >>>>> > on rdd1 and rdd1 is a persistent RDD (storage level != None) so that >>>>> I can >>>>> > do the no-op on rdd1 before I run rdd2. I would much rather the DAG >>>>> figure >>>>> > this out so I don't need to think about all this. >>>>> >>>> >>>> >>> >>> >> >