Exactly how the query is executed actually depends on a couple of factors as we do a bunch of optimizations based on the top physical operator and the final RDD operation that is performed. In general the compute function is only used when you are doing SQL followed by other RDD operations (map, flatMap, etc). When you call collect we usually call collect directly on the underlying physical RDD (which is not exposed to users since it plays tricks like object reuse under the covers). However, if your query has a LIMIT then we perform a take, and if you have an ORDER BY and a LIMIT then we takeOrdered, etc.
On Wed, Nov 26, 2014 at 5:05 AM, Jörg Schad <[email protected]> wrote: > Hi, > I have a short question regarding the compute() of an SchemaRDD. > For SchemaRDD the actual queryExecution seems to be triggered via > collect(), while the compute triggers only the compute() of the parent and > copies the data (Please correct me if I am wrong!). > > Is this compute() triggered at all when I do something like: > *val schemaRDD2 = schemaRDD.where(...)* > *schemaRDD2.collect() * > > And if not when is the compute function triggered/ what is the intend > behind it? > > Sorry if this is a trivial question, just getting started with spark > (SQL).... > Thanks, > Joerg > > > > >
