Re: Multiple joins in Spark

Xiao Li Fri, 16 Oct 2015 17:16:26 -0700

Hi, Shyam,

You still can use SQL to do the same thing in Spark:


For example,

    val df1 = sqlContext.createDataFrame(rdd)
    val df2 = sqlContext.createDataFrame(rdd2)
    val df3 = sqlContext.createDataFrame(rdd3)
    df1.registerTempTable("tab1")
    df2.registerTempTable("tab2")
    df3.registerTempTable("tab3")
    val exampleSQL = sqlContext.sql("select * from tab1, tab2, tab3 where
tab1.name = tab2.name and tab2.id = tab3.id")

Good luck,

Xiao Li

2015-10-16 17:01 GMT-07:00 Shyam Parimal Katti <[email protected]>:

> Hello All,
>
> I have a following SQL query like this:
>
> select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id =
> b.a_id join table_c c on b.b_id = c.b_id
>
> In scala i have done this so far:
>
> table_a_rdd = sc.textFile(...)
> table_b_rdd = sc.textFile(...)
> table_c_rdd = sc.textFile(...)
>
> val table_a_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
> (line(0), line))
> val table_b_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
> (line(0), line))
> val table_c_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
> (line(0), line))
>
> Each line has the first value at its primary key.
>
> While I can join 2 RDDs using table_a_rowRDD.join(table_b_rowRDD) to join,
> is it possible to join multiple RDDs in a single expression? like
> table_a_rowRDD.join(table_b_rowRDD).join(table_c_rowRDD) ? Also, how can I
> specify the column on which I can join multiple RDDs?
>
>
>
>
>

Re: Multiple joins in Spark

Reply via email to