Thanks Michael, that helped me a lot! On 23 November 2015 at 17:47, Michael Armbrust <mich...@databricks.com> wrote:
> Here is how I view the relationship between the various components of > Spark: > > - *RDDs - *a low level API for expressing DAGs that will be executed in > parallel by Spark workers > - *Catalyst -* an internal library for expressing trees that we use to > build relational algebra and expression evaluation. There's also an > optimizer and query planner than turns these into logical concepts into RDD > actions. > - *Tungsten -* an internal optimized execution engine that can compile > catalyst expressions into efficient java bytecode that operates directly on > serialized binary data. It also has nice low level data structures / > algorithms like hash tables and sorting that operate directly on serialized > data. These are used by the physical nodes that are produced by the query > planner (and run inside of RDD operation on workers). > - *DataFrames - *a user facing API that is similar to SQL/LINQ for > constructing dataflows that are backed by catalyst logical plans > - *Datasets* - a user facing API that is similar to the RDD API for > constructing dataflows that are backed by catalyst logical plans > > So everything is still operating on RDDs but I anticipate most users will > eventually migrate to the higher level APIs for convenience and automatic > optimization > > On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky <joder...@gmail.com> wrote: > >> Hi everyone, >> >> I'm doing some reading-up on all the newer features of Spark such as >> DataFrames, DataSets and Project Tungsten. This got me a bit confused on >> the relation between all these concepts. >> >> When starting to learn Spark, I read a book and the original paper on >> RDDs, this lead me to basically think "Spark == RDDs". >> Now, looking into DataFrames, I read that they are basically >> (distributed) collections with an associated schema, thus enabling >> declarative queries and optimization (through Catalyst). I am uncertain how >> DataFrames relate to RDDs: are DataFrames transformed to operations on RDDs >> once they have been optimized? Or are they completely different concepts? >> In case of the latter, do DataFrames still use the Spark scheduler and get >> broken down into a DAG of stages and tasks? >> >> Regarding project Tungsten, where does it fit in? To my understanding it >> is used to efficiently cache data in memory and may also be used to >> generate query code for specialized hardware. This sounds as though it >> would work on Spark's worker nodes, however it would also only work with >> schema-associated data (aka DataFrames), thus leading me to the conclusion >> that RDDs and DataFrames do not share a common backend which in turn >> contradicts my conception of "Spark == RDDs". >> >> Maybe I missed the obvious as these questions seem pretty basic, however >> I was unable to find clear answers in Spark documentation or related papers >> and talks. I would greatly appreciate any clarifications. >> >> thanks, >> --Jakob >> > >