Re: Example of creating expressions for SchemaRDD methods

Michael Armbrust Fri, 04 Apr 2014 11:59:25 -0700

Minor typo in the example.  The first SELECT statement should actually be:

sql("SELECT * FROM src")


Where `src` is a HiveTable with schema (key INT value STRING).


On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust <mich...@databricks.com>wrote:

>
> In such construct, each operator builds on the previous one, including any
>> materialized results etc. If I use a SQL for each of them, I suspect the
>> later SQLs will not leverage the earlier SQLs by any means - hence these
>> will be inefficient to first approach. Let me know if this is not correct.
>>
>
> This is not correct.  When you run a SQL statement and register it as a
> table, it is the logical plan for this query is used when this virtual
> table is referenced in later queries, not the results.  SQL queries are
> lazy, just like RDDs and DSL queries.  This is illustrated below.
>
>
> scala> sql("SELECT * FROM selectQuery")
> res3: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[12] at RDD at SchemaRDD.scala:93
> == Query Plan ==
> HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None
>
> scala> sql("SELECT * FROM src").registerAsTable("selectQuery")
>
> scala> sql("SELECT key FROM selectQuery")
> res5: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[24] at RDD at SchemaRDD.scala:93
> == Query Plan ==
> HiveTableScan [key#8], (MetastoreRelation default, src, None), None
>
> Even though the second query is running over the "results" of the first
> query (which requested all columns using *), the optimizer is still able to
> come up with an efficient plan that avoids reading "value" from the table,
> as can be seen by the arguments of the HiveTableScan.
>
> Note that if you call sqlContext.cacheTable("selectQuery") then you are
> correct.  The results will be materialized in an in-memory columnar format,
> and subsequent queries will be run over these materialized results.
>
>
>> The reason for building expressions is that the use case needs these to
>> be created on the fly based on some case class at runtime.
>>
>> I.e., I can't type these in REPL. The scala code will define some case
>> class A (a: ... , b: ..., c: ... ) where class name, member names and types
>> will be known before hand and the RDD will be defined on this. Then based
>> on user action, above pipeline needs to be constructed on fly. Thus the
>> expressions has to be constructed on fly from class members and other
>> predicates etc., most probably using expression constructors.
>>
>> Could you please share how expressions could be constructed using the
>> APIs on expression (and not on REPL) ?
>>
>
> I'm not sure I completely understand the use case here, but you should be
> able to construct symbols and use the DSL to create expressions at runtime,
> just like in the REPL.
>
> val attrName: String = "name"
> val addExpression: Expression = Symbol(attrName) + Symbol(attrName)
>
> There is currently no public API for constructing expressions manually
> other than SQL or the DSL.  While you could dig into
> org.apache.spark.sql.catalyst.expressions._, these APIs are considered
> internal, and *will not be stable in between versions*.
>
> Michael
>
>
>
>

Re: Example of creating expressions for SchemaRDD methods

Reply via email to