Minor typo in the example. The first SELECT statement should actually be: sql("SELECT * FROM src")
Where `src` is a HiveTable with schema (key INT value STRING). On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust <mich...@databricks.com>wrote: > > In such construct, each operator builds on the previous one, including any >> materialized results etc. If I use a SQL for each of them, I suspect the >> later SQLs will not leverage the earlier SQLs by any means - hence these >> will be inefficient to first approach. Let me know if this is not correct. >> > > This is not correct. When you run a SQL statement and register it as a > table, it is the logical plan for this query is used when this virtual > table is referenced in later queries, not the results. SQL queries are > lazy, just like RDDs and DSL queries. This is illustrated below. > > > scala> sql("SELECT * FROM selectQuery") > res3: org.apache.spark.sql.SchemaRDD = > SchemaRDD[12] at RDD at SchemaRDD.scala:93 > == Query Plan == > HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None > > scala> sql("SELECT * FROM src").registerAsTable("selectQuery") > > scala> sql("SELECT key FROM selectQuery") > res5: org.apache.spark.sql.SchemaRDD = > SchemaRDD[24] at RDD at SchemaRDD.scala:93 > == Query Plan == > HiveTableScan [key#8], (MetastoreRelation default, src, None), None > > Even though the second query is running over the "results" of the first > query (which requested all columns using *), the optimizer is still able to > come up with an efficient plan that avoids reading "value" from the table, > as can be seen by the arguments of the HiveTableScan. > > Note that if you call sqlContext.cacheTable("selectQuery") then you are > correct. The results will be materialized in an in-memory columnar format, > and subsequent queries will be run over these materialized results. > > >> The reason for building expressions is that the use case needs these to >> be created on the fly based on some case class at runtime. >> >> I.e., I can't type these in REPL. The scala code will define some case >> class A (a: ... , b: ..., c: ... ) where class name, member names and types >> will be known before hand and the RDD will be defined on this. Then based >> on user action, above pipeline needs to be constructed on fly. Thus the >> expressions has to be constructed on fly from class members and other >> predicates etc., most probably using expression constructors. >> >> Could you please share how expressions could be constructed using the >> APIs on expression (and not on REPL) ? >> > > I'm not sure I completely understand the use case here, but you should be > able to construct symbols and use the DSL to create expressions at runtime, > just like in the REPL. > > val attrName: String = "name" > val addExpression: Expression = Symbol(attrName) + Symbol(attrName) > > There is currently no public API for constructing expressions manually > other than SQL or the DSL. While you could dig into > org.apache.spark.sql.catalyst.expressions._, these APIs are considered > internal, and *will not be stable in between versions*. > > Michael > > > >