Re: advantages of SparkSQL?

Cheng Lian Mon, 24 Nov 2014 20:51:32 -0800

For the “never register a table” part, actually you /can/ use Spark SQLwithout registering a table via its DSL. Say you’re going to extract an|Int| field named |key| from the table and double it:


|import  org.apache.sql.catalyst.dsl._
val  data  =  sqc.parquetFile(path)
val  double  =  (i:Int) => i *2
data.select(double.call('key) as'result).collect()
|

Notice that the |.call| method is only available in the most recentmaster and branch-1.2.


On 11/25/14 5:19 AM, Michael Armbrust wrote:

Akshat is correct about the benefits of parquet as a columnar format,but I'll add that some of this is lost if you just use a lambdafunction to process the data. Since your lambda function is a blackbox Spark SQL does not know which columns it is going to use and thuswill do a full tablescan. I'd suggest writing a very simple SQL querythat pulls out just the columns you need and does any filtering beforedropping back into standard spark operations. The result of SQLqueries is an RDD of rows so you can do any normal spark processingyou want on them.
Either way though it will often be faster than a text filed due tobetter encoding/compression.
On Mon, Nov 24, 2014 at 8:54 AM, Akshat Aranya <[email protected]<mailto:[email protected]>> wrote:
    Parquet is a column-oriented format, which means that you need to
    read in less data from the file system if you're only interested
    in a subset of your columns.  Also, Parquet pushes down selection
    predicates, which can eliminate needless deserialization of rows
    that don't match a selection criterion.  Other than that, you
    would also get compression, and likely save processor cycles when
    parsing lines from text files.



    On Mon, Nov 24, 2014 at 8:20 AM, mrm <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        Is there any advantage to storing data as a parquet format,
        loading it using
        the sparkSQL context, but never registering as a table/using
        sql on it?
        Something like:

        Something like:
        data = sqc.parquetFile(path)
        results =  data.map(lambda x: applyfunc(x.field))

        Is this faster/more optimised than having the data stored as a
        text file and
        using Spark (non-SQL) to process it?



        --
        View this message in context:
        
http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html
        Sent from the Apache Spark User List mailing list archive at
        Nabble.com.

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: [email protected]
        <mailto:[email protected]>
        For additional commands, e-mail: [email protected]
        <mailto:[email protected]>

Re: advantages of SparkSQL?

Reply via email to