Here is how I view the relationship between the various components of Spark:

 - *RDDs - *a low level API for expressing DAGs that will be executed in
parallel by Spark workers
 - *Catalyst -* an internal library for expressing trees that we use to
build relational algebra and expression evaluation.  There's also an
optimizer and query planner than turns these into logical concepts into RDD
actions.
 - *Tungsten -* an internal optimized execution engine that can compile
catalyst expressions into efficient java bytecode that operates directly on
serialized binary data.  It also has nice low level data structures /
algorithms like hash tables and sorting that operate directly on serialized
data.  These are used by the physical nodes that are produced by the query
planner (and run inside of RDD operation on workers).
 - *DataFrames - *a user facing API that is similar to SQL/LINQ for
constructing dataflows that are backed by catalyst logical plans
 - *Datasets* - a user facing API that is similar to the RDD API for
constructing dataflows that are backed by catalyst logical plans

So everything is still operating on RDDs but I anticipate most users will
eventually migrate to the higher level APIs for convenience and automatic
optimization

On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky <joder...@gmail.com> wrote:

> Hi everyone,
>
> I'm doing some reading-up on all the newer features of Spark such as
> DataFrames, DataSets and Project Tungsten. This got me a bit confused on
> the relation between all these concepts.
>
> When starting to learn Spark, I read a book and the original paper on
> RDDs, this lead me to basically think "Spark == RDDs".
> Now, looking into DataFrames, I read that they are basically (distributed)
> collections with an associated schema, thus enabling declarative queries
> and optimization (through Catalyst). I am uncertain how DataFrames relate
> to RDDs: are DataFrames transformed to operations on RDDs once they have
> been optimized? Or are they completely different concepts? In case of the
> latter, do DataFrames still use the Spark scheduler and get broken down
> into a DAG of stages and tasks?
>
> Regarding project Tungsten, where does it fit in? To my understanding it
> is used to efficiently cache data in memory and may also be used to
> generate query code for specialized hardware. This sounds as though it
> would work on Spark's worker nodes, however it would also only work with
> schema-associated data (aka DataFrames), thus leading me to the conclusion
> that RDDs and DataFrames do not share a common backend which in turn
> contradicts my conception of "Spark == RDDs".
>
> Maybe I missed the obvious as these questions seem pretty basic, however I
> was unable to find clear answers in Spark documentation or related papers
> and talks. I would greatly appreciate any clarifications.
>
> thanks,
> --Jakob
>

Reply via email to