Re: columnar structure of RDDs from Parquet or ORC files

Cheng Lian Mon, 08 Jun 2015 07:56:58 -0700

You may refer to DataFrame Scaladochttp://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Methods listed in "Language Integrated Queries" and "RDD Options" can beviewed as "transformations", and those listed in "Actions" are, ofcourse, actions. As for SQLContext.load, it's listed in the "GenericData Sources" sectionhttp://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext

I think a simple rule can be: if a DataFrame method or a SQLContextmethod returns a DataFrame or an RDD, then it is lazily evaluated, sinceDataFrame and RDD are both lazily evaluated.


Cheng

On 6/8/15 8:11 PM, kiran lonikar wrote:

Thanks. Can you point me to a place in the documentation of SQLprogramming guide or DataFrame scaladoc where this transformation andactions are grouped like in the case of RDD?

Also if you can tell me if sqlContext.load and unionAll aretransformations or actions...

I answered a question on the forum assuming unionAll is a blockingcall and said execution of multiple load and df.unionAll in differentthreads would benefit performance :)


Kiran

On 08-Jun-2015 4:37 pm, "Cheng Lian" <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    For DataFrame, there are also transformations and actions. And
    transformations are also lazily evaluated. However, DataFrame
    transformations like filter(), select(), agg() return a DataFrame
    rather than an RDD. Other methods like show() and collect() are
    actions.

    Cheng

    On 6/8/15 1:33 PM, kiran lonikar wrote:

    Thanks for replying twice :) I think I sent this question by
    email and somehow thought I did not sent it, hence created the
    other one on the web interface. Lets retain this thread since you
    have provided more details here.

    Great, it confirms my intuition about DataFrame. It's similar to
    Shark columnar layout, with the addition of compression. There it
    used java nio's ByteBuffer to hold actual data. I will go through
    the code you pointed.

    I have another question about DataFrame: The RDD operations are
    divided in two groups: *transformations *which are lazily
    evaluated and return a new RDD and *actions *which evaluate
    lineage defined by transformations, invoke actions and return
    results. What about DataFrame operations like join, groupBy, agg,
    unionAll etc which are all transformations in RDD? Are they
    lazily evaluated or immediately executed?

Re: columnar structure of RDDs from Parquet or ORC files

Reply via email to