You may refer to DataFrame Scaladoc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Methods listed in "Language Integrated Queries" and "RDD Options" can be viewed as "transformations", and those listed in "Actions" are, of course, actions. As for SQLContext.load, it's listed in the "Generic Data Sources" section http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext

I think a simple rule can be: if a DataFrame method or a SQLContext method returns a DataFrame or an RDD, then it is lazily evaluated, since DataFrame and RDD are both lazily evaluated.

Cheng

On 6/8/15 8:11 PM, kiran lonikar wrote:

Thanks. Can you point me to a place in the documentation of SQL programming guide or DataFrame scaladoc where this transformation and actions are grouped like in the case of RDD?

Also if you can tell me if sqlContext.load and unionAll are transformations or actions...

I answered a question on the forum assuming unionAll is a blocking call and said execution of multiple load and df.unionAll in different threads would benefit performance :)

Kiran

On 08-Jun-2015 4:37 pm, "Cheng Lian" <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

    For DataFrame, there are also transformations and actions. And
    transformations are also lazily evaluated. However, DataFrame
    transformations like filter(), select(), agg() return a DataFrame
    rather than an RDD. Other methods like show() and collect() are
    actions.

    Cheng

    On 6/8/15 1:33 PM, kiran lonikar wrote:
    Thanks for replying twice :) I think I sent this question by
    email and somehow thought I did not sent it, hence created the
    other one on the web interface. Lets retain this thread since you
    have provided more details here.

    Great, it confirms my intuition about DataFrame. It's similar to
    Shark columnar layout, with the addition of compression. There it
    used java nio's ByteBuffer to hold actual data. I will go through
    the code you pointed.

    I have another question about DataFrame: The RDD operations are
    divided in two groups: *transformations *which are lazily
    evaluated and return a new RDD and *actions *which evaluate
    lineage defined by transformations, invoke actions and return
    results. What about DataFrame operations like join, groupBy, agg,
    unionAll etc which are all transformations in RDD? Are they
    lazily evaluated or immediately executed?




Reply via email to