> In such construct, each operator builds on the previous one, including any > materialized results etc. If I use a SQL for each of them, I suspect the > later SQLs will not leverage the earlier SQLs by any means - hence these > will be inefficient to first approach. Let me know if this is not correct. >
This is not correct. When you run a SQL statement and register it as a table, it is the logical plan for this query is used when this virtual table is referenced in later queries, not the results. SQL queries are lazy, just like RDDs and DSL queries. This is illustrated below. scala> sql("SELECT * FROM selectQuery") res3: org.apache.spark.sql.SchemaRDD = SchemaRDD[12] at RDD at SchemaRDD.scala:93 == Query Plan == HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None scala> sql("SELECT * FROM src").registerAsTable("selectQuery") scala> sql("SELECT key FROM selectQuery") res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[24] at RDD at SchemaRDD.scala:93 == Query Plan == HiveTableScan [key#8], (MetastoreRelation default, src, None), None Even though the second query is running over the "results" of the first query (which requested all columns using *), the optimizer is still able to come up with an efficient plan that avoids reading "value" from the table, as can be seen by the arguments of the HiveTableScan. Note that if you call sqlContext.cacheTable("selectQuery") then you are correct. The results will be materialized in an in-memory columnar format, and subsequent queries will be run over these materialized results. > The reason for building expressions is that the use case needs these to be > created on the fly based on some case class at runtime. > > I.e., I can't type these in REPL. The scala code will define some case > class A (a: ... , b: ..., c: ... ) where class name, member names and types > will be known before hand and the RDD will be defined on this. Then based > on user action, above pipeline needs to be constructed on fly. Thus the > expressions has to be constructed on fly from class members and other > predicates etc., most probably using expression constructors. > > Could you please share how expressions could be constructed using the APIs > on expression (and not on REPL) ? > I'm not sure I completely understand the use case here, but you should be able to construct symbols and use the DSL to create expressions at runtime, just like in the REPL. val attrName: String = "name" val addExpression: Expression = Symbol(attrName) + Symbol(attrName) There is currently no public API for constructing expressions manually other than SQL or the DSL. While you could dig into org.apache.spark.sql.catalyst.expressions._, these APIs are considered internal, and *will not be stable in between versions*. Michael