Hi, Shyam, The method registerTempTable is to register a [DataFrame as a temporary table in the Catalog using the given table name.
In the Catalog, Spark maintains a concurrent hashmap, which contains the pair of the table names and the logical plan. For example, when we submit the following query, SELECT * FROM inMemoryDF The concurrent hashmap contains one map from name to the Logical Plan: "inMemoryDF" -> "LogicalRDD [c1#0,c2#1,c3#2,c4#3], MapPartitionsRDD[1] at createDataFrame at SimpleApp.scala:42 Therefore, using SQL will not hurt your performance. The actual physical plan to execute your SQL query is generated by the result of Catalyst optimizer. Good luck, Xiao Li 2015-10-16 20:53 GMT-07:00 Shyam Parimal Katti <[email protected]>: > Thanks Xiao! Question about the internals, would you know what happens > when createTempTable() is called? I. E. Does it create an RDD internally > or some internal representation that lets it join with an RDD? > > Again, thanks for the answer. > On Oct 16, 2015 8:15 PM, "Xiao Li" <[email protected]> wrote: > >> Hi, Shyam, >> >> You still can use SQL to do the same thing in Spark: >> >> For example, >> >> val df1 = sqlContext.createDataFrame(rdd) >> val df2 = sqlContext.createDataFrame(rdd2) >> val df3 = sqlContext.createDataFrame(rdd3) >> df1.registerTempTable("tab1") >> df2.registerTempTable("tab2") >> df3.registerTempTable("tab3") >> val exampleSQL = sqlContext.sql("select * from tab1, tab2, tab3 where >> tab1.name = tab2.name and tab2.id = tab3.id") >> >> Good luck, >> >> Xiao Li >> >> 2015-10-16 17:01 GMT-07:00 Shyam Parimal Katti <[email protected]>: >> >>> Hello All, >>> >>> I have a following SQL query like this: >>> >>> select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id = >>> b.a_id join table_c c on b.b_id = c.b_id >>> >>> In scala i have done this so far: >>> >>> table_a_rdd = sc.textFile(...) >>> table_b_rdd = sc.textFile(...) >>> table_c_rdd = sc.textFile(...) >>> >>> val table_a_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line => >>> (line(0), line)) >>> val table_b_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line => >>> (line(0), line)) >>> val table_c_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line => >>> (line(0), line)) >>> >>> Each line has the first value at its primary key. >>> >>> While I can join 2 RDDs using table_a_rowRDD.join(table_b_rowRDD) to >>> join, is it possible to join multiple RDDs in a single expression? like >>> table_a_rowRDD.join(table_b_rowRDD).join(table_c_rowRDD) ? Also, how can I >>> specify the column on which I can join multiple RDDs? >>> >>> >>> >>> >>> >>
