You may refer to DataFrame Scaladoc
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Methods listed in "Language Integrated Queries" and "RDD Options" can be
viewed as "transformations", and those listed in "Actions" are, of
course, actions. As for SQLContext.load, it's listed in the "Generic
Data Sources" section
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
I think a simple rule can be: if a DataFrame method or a SQLContext
method returns a DataFrame or an RDD, then it is lazily evaluated, since
DataFrame and RDD are both lazily evaluated.
Cheng
On 6/8/15 8:11 PM, kiran lonikar wrote:
Thanks. Can you point me to a place in the documentation of SQL
programming guide or DataFrame scaladoc where this transformation and
actions are grouped like in the case of RDD?
Also if you can tell me if sqlContext.load and unionAll are
transformations or actions...
I answered a question on the forum assuming unionAll is a blocking
call and said execution of multiple load and df.unionAll in different
threads would benefit performance :)
Kiran
On 08-Jun-2015 4:37 pm, "Cheng Lian" <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
For DataFrame, there are also transformations and actions. And
transformations are also lazily evaluated. However, DataFrame
transformations like filter(), select(), agg() return a DataFrame
rather than an RDD. Other methods like show() and collect() are
actions.
Cheng
On 6/8/15 1:33 PM, kiran lonikar wrote:
Thanks for replying twice :) I think I sent this question by
email and somehow thought I did not sent it, hence created the
other one on the web interface. Lets retain this thread since you
have provided more details here.
Great, it confirms my intuition about DataFrame. It's similar to
Shark columnar layout, with the addition of compression. There it
used java nio's ByteBuffer to hold actual data. I will go through
the code you pointed.
I have another question about DataFrame: The RDD operations are
divided in two groups: *transformations *which are lazily
evaluated and return a new RDD and *actions *which evaluate
lineage defined by transformations, invoke actions and return
results. What about DataFrame operations like join, groupBy, agg,
unionAll etc which are all transformations in RDD? Are they
lazily evaluated or immediately executed?