Re: Relation between RDDs, DataFrames and Project Tungsten

Jakob Odersky Mon, 23 Nov 2015 17:57:07 -0800

Thanks Michael, that helped me a lot!

On 23 November 2015 at 17:47, Michael Armbrust <mich...@databricks.com>
wrote:


> Here is how I view the relationship between the various components of
> Spark:
>
>  - *RDDs - *a low level API for expressing DAGs that will be executed in
> parallel by Spark workers
>  - *Catalyst -* an internal library for expressing trees that we use to
> build relational algebra and expression evaluation.  There's also an
> optimizer and query planner than turns these into logical concepts into RDD
> actions.
>  - *Tungsten -* an internal optimized execution engine that can compile
> catalyst expressions into efficient java bytecode that operates directly on
> serialized binary data.  It also has nice low level data structures /
> algorithms like hash tables and sorting that operate directly on serialized
> data.  These are used by the physical nodes that are produced by the query
> planner (and run inside of RDD operation on workers).
>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
> constructing dataflows that are backed by catalyst logical plans
>  - *Datasets* - a user facing API that is similar to the RDD API for
> constructing dataflows that are backed by catalyst logical plans
>
> So everything is still operating on RDDs but I anticipate most users will
> eventually migrate to the higher level APIs for convenience and automatic
> optimization
>
> On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky <joder...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm doing some reading-up on all the newer features of Spark such as
>> DataFrames, DataSets and Project Tungsten. This got me a bit confused on
>> the relation between all these concepts.
>>
>> When starting to learn Spark, I read a book and the original paper on
>> RDDs, this lead me to basically think "Spark == RDDs".
>> Now, looking into DataFrames, I read that they are basically
>> (distributed) collections with an associated schema, thus enabling
>> declarative queries and optimization (through Catalyst). I am uncertain how
>> DataFrames relate to RDDs: are DataFrames transformed to operations on RDDs
>> once they have been optimized? Or are they completely different concepts?
>> In case of the latter, do DataFrames still use the Spark scheduler and get
>> broken down into a DAG of stages and tasks?
>>
>> Regarding project Tungsten, where does it fit in? To my understanding it
>> is used to efficiently cache data in memory and may also be used to
>> generate query code for specialized hardware. This sounds as though it
>> would work on Spark's worker nodes, however it would also only work with
>> schema-associated data (aka DataFrames), thus leading me to the conclusion
>> that RDDs and DataFrames do not share a common backend which in turn
>> contradicts my conception of "Spark == RDDs".
>>
>> Maybe I missed the obvious as these questions seem pretty basic, however
>> I was unable to find clear answers in Spark documentation or related papers
>> and talks. I would greatly appreciate any clarifications.
>>
>> thanks,
>> --Jakob
>>
>
>

Re: Relation between RDDs, DataFrames and Project Tungsten

Reply via email to