Hi everyone,

I'm doing some reading-up on all the newer features of Spark such as
DataFrames, DataSets and Project Tungsten. This got me a bit confused on
the relation between all these concepts.

When starting to learn Spark, I read a book and the original paper on RDDs,
this lead me to basically think "Spark == RDDs".
Now, looking into DataFrames, I read that they are basically (distributed)
collections with an associated schema, thus enabling declarative queries
and optimization (through Catalyst). I am uncertain how DataFrames relate
to RDDs: are DataFrames transformed to operations on RDDs once they have
been optimized? Or are they completely different concepts? In case of the
latter, do DataFrames still use the Spark scheduler and get broken down
into a DAG of stages and tasks?

Regarding project Tungsten, where does it fit in? To my understanding it is
used to efficiently cache data in memory and may also be used to generate
query code for specialized hardware. This sounds as though it would work on
Spark's worker nodes, however it would also only work with
schema-associated data (aka DataFrames), thus leading me to the conclusion
that RDDs and DataFrames do not share a common backend which in turn
contradicts my conception of "Spark == RDDs".

Maybe I missed the obvious as these questions seem pretty basic, however I
was unable to find clear answers in Spark documentation or related papers
and talks. I would greatly appreciate any clarifications.

thanks,
--Jakob

Reply via email to