Here is how I view the relationship between the various components of Spark:
- *RDDs - *a low level API for expressing DAGs that will be executed in parallel by Spark workers - *Catalyst -* an internal library for expressing trees that we use to build relational algebra and expression evaluation. There's also an optimizer and query planner than turns these into logical concepts into RDD actions. - *Tungsten -* an internal optimized execution engine that can compile catalyst expressions into efficient java bytecode that operates directly on serialized binary data. It also has nice low level data structures / algorithms like hash tables and sorting that operate directly on serialized data. These are used by the physical nodes that are produced by the query planner (and run inside of RDD operation on workers). - *DataFrames - *a user facing API that is similar to SQL/LINQ for constructing dataflows that are backed by catalyst logical plans - *Datasets* - a user facing API that is similar to the RDD API for constructing dataflows that are backed by catalyst logical plans So everything is still operating on RDDs but I anticipate most users will eventually migrate to the higher level APIs for convenience and automatic optimization On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky <joder...@gmail.com> wrote: > Hi everyone, > > I'm doing some reading-up on all the newer features of Spark such as > DataFrames, DataSets and Project Tungsten. This got me a bit confused on > the relation between all these concepts. > > When starting to learn Spark, I read a book and the original paper on > RDDs, this lead me to basically think "Spark == RDDs". > Now, looking into DataFrames, I read that they are basically (distributed) > collections with an associated schema, thus enabling declarative queries > and optimization (through Catalyst). I am uncertain how DataFrames relate > to RDDs: are DataFrames transformed to operations on RDDs once they have > been optimized? Or are they completely different concepts? In case of the > latter, do DataFrames still use the Spark scheduler and get broken down > into a DAG of stages and tasks? > > Regarding project Tungsten, where does it fit in? To my understanding it > is used to efficiently cache data in memory and may also be used to > generate query code for specialized hardware. This sounds as though it > would work on Spark's worker nodes, however it would also only work with > schema-associated data (aka DataFrames), thus leading me to the conclusion > that RDDs and DataFrames do not share a common backend which in turn > contradicts my conception of "Spark == RDDs". > > Maybe I missed the obvious as these questions seem pretty basic, however I > was unable to find clear answers in Spark documentation or related papers > and talks. I would greatly appreciate any clarifications. > > thanks, > --Jakob >