Hi everyone, I'm doing some reading-up on all the newer features of Spark such as DataFrames, DataSets and Project Tungsten. This got me a bit confused on the relation between all these concepts.
When starting to learn Spark, I read a book and the original paper on RDDs, this lead me to basically think "Spark == RDDs". Now, looking into DataFrames, I read that they are basically (distributed) collections with an associated schema, thus enabling declarative queries and optimization (through Catalyst). I am uncertain how DataFrames relate to RDDs: are DataFrames transformed to operations on RDDs once they have been optimized? Or are they completely different concepts? In case of the latter, do DataFrames still use the Spark scheduler and get broken down into a DAG of stages and tasks? Regarding project Tungsten, where does it fit in? To my understanding it is used to efficiently cache data in memory and may also be used to generate query code for specialized hardware. This sounds as though it would work on Spark's worker nodes, however it would also only work with schema-associated data (aka DataFrames), thus leading me to the conclusion that RDDs and DataFrames do not share a common backend which in turn contradicts my conception of "Spark == RDDs". Maybe I missed the obvious as these questions seem pretty basic, however I was unable to find clear answers in Spark documentation or related papers and talks. I would greatly appreciate any clarifications. thanks, --Jakob