Re: Multiple joins in Spark

Xiao Li Tue, 20 Oct 2015 11:39:13 -0700

Are you using hiveContext?

First, build your Spark using the following command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver
-DskipTests clean package


Then, try this sample program

object SimpleApp {
  case class Individual(name: String, surname: String, birthDate: String)

  def main(args: Array[String]) {
    val sc = new SparkContext("local", "join DFs")
    //val sqlContext = new SQLContext(sc)
    val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

    val rdd = sc.parallelize(Seq(
      Individual("a", "c", "10/10/1972"),
      Individual("b", "d", "10/11/1970"),
    ))

    val df = hiveContext.createDataFrame(rdd)

    df.registerTempTable("tab")

    val dfHive = hiveContext.sql("select * from tab")

    dfHive.show()
  }
}


2015-10-20 6:24 GMT-07:00 Shyam Parimal Katti <[email protected]>:

> When I do the steps above and run a query like this:
>
> sqlContext.sql("select * from ...")
>
> I get exception:
>
> org.apache.spark.sql.AnalysisException: Non-local session path expected to
> be non-null;
>    at org.apache.spark.sql.hive.HiveQL$.createPlan(HiveQl.scala:260)
>    .....
>
> I cannot paste the entire stack since it's on a company laptop and I am
> not allowed to copy paste things! Though if absolutely needed to help, I
> can figure out some way to provide it.
>
> On Sat, Oct 17, 2015 at 1:13 AM, Xiao Li <[email protected]> wrote:
>
>> Hi, Shyam,
>>
>> The method registerTempTable is to register a [DataFrame as a temporary
>> table in the Catalog using the given table name.
>>
>> In the Catalog, Spark maintains a concurrent hashmap, which contains the
>> pair of the table names and the logical plan.
>>
>> For example, when we submit the following query,
>>
>> SELECT * FROM inMemoryDF
>>
>> The concurrent hashmap contains one map from name to the Logical Plan:
>>
>> "inMemoryDF" -> "LogicalRDD [c1#0,c2#1,c3#2,c4#3], MapPartitionsRDD[1] at
>> createDataFrame at SimpleApp.scala:42
>>
>> Therefore, using SQL will not hurt your performance. The actual physical
>> plan to execute your SQL query is generated by the result of Catalyst
>> optimizer.
>>
>> Good luck,
>>
>> Xiao Li
>>
>>
>>
>> 2015-10-16 20:53 GMT-07:00 Shyam Parimal Katti <[email protected]>:
>>
>>> Thanks Xiao! Question about the internals, would you know what happens
>>> when createTempTable() is called? I. E.  Does it create an RDD internally
>>> or some internal representation that lets it join with  an RDD?
>>>
>>> Again, thanks for the answer.
>>> On Oct 16, 2015 8:15 PM, "Xiao Li" <[email protected]> wrote:
>>>
>>>> Hi, Shyam,
>>>>
>>>> You still can use SQL to do the same thing in Spark:
>>>>
>>>> For example,
>>>>
>>>>     val df1 = sqlContext.createDataFrame(rdd)
>>>>     val df2 = sqlContext.createDataFrame(rdd2)
>>>>     val df3 = sqlContext.createDataFrame(rdd3)
>>>>     df1.registerTempTable("tab1")
>>>>     df2.registerTempTable("tab2")
>>>>     df3.registerTempTable("tab3")
>>>>     val exampleSQL = sqlContext.sql("select * from tab1, tab2, tab3
>>>> where tab1.name = tab2.name and tab2.id = tab3.id")
>>>>
>>>> Good luck,
>>>>
>>>> Xiao Li
>>>>
>>>> 2015-10-16 17:01 GMT-07:00 Shyam Parimal Katti <[email protected]>:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I have a following SQL query like this:
>>>>>
>>>>> select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id
>>>>> = b.a_id join table_c c on b.b_id = c.b_id
>>>>>
>>>>> In scala i have done this so far:
>>>>>
>>>>> table_a_rdd = sc.textFile(...)
>>>>> table_b_rdd = sc.textFile(...)
>>>>> table_c_rdd = sc.textFile(...)
>>>>>
>>>>> val table_a_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>>> (line(0), line))
>>>>> val table_b_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>>> (line(0), line))
>>>>> val table_c_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>>> (line(0), line))
>>>>>
>>>>> Each line has the first value at its primary key.
>>>>>
>>>>> While I can join 2 RDDs using table_a_rowRDD.join(table_b_rowRDD) to
>>>>> join, is it possible to join multiple RDDs in a single expression? like
>>>>> table_a_rowRDD.join(table_b_rowRDD).join(table_c_rowRDD) ? Also, how can I
>>>>> specify the column on which I can join multiple RDDs?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Multiple joins in Spark

Reply via email to