Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-12 Thread Tathagata Das
You have understood the problem right. However note that your interpretation of the output *(K, leftValue, null), **(K, leftValue, rightValue1), **(K, leftValue, rightValue2)* is subject to the knowledge of the semantics of the join. That if you are processing the output rows *manually*, you are

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-10 Thread kant kodali
I will give an attempt to answer this. since rightValue1 and rightValue2 have the same key "K"(two matches) why would it ever be the case *rightValue2* replacing *rightValue1* replacing *null? *Moreover, why does user need to care? The result in this case (after getting 2 matches) should be

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Tathagata Das
This doc is unrelated to the stream-stream join we added in Structured Streaming. :) That said we added append mode first because it easier to reason about the semantics of append mode especially in the context of outer joins. You output a row only when you know it wont be changed ever. The

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Gourav Sengupta
super interesting. On Wed, Mar 7, 2018 at 11:44 AM, kant kodali wrote: > It looks to me that the StateStore described in this doc > > Actually > has full outer join and every other join

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-07 Thread kant kodali
It looks to me that the StateStore described in this doc Actually has full outer join and every other join is a filter of that. Also the doc talks about update mode but looks like Spark 2.3 ended up with append

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Sorry I meant Spark 2.4 in my previous email On Tue, Mar 6, 2018 at 9:15 PM, kant kodali wrote: > Hi TD, > > I agree I think we are better off either with a full fix or no fix. I am > ok with the complete fix being available in master or some branch. I guess > the solution

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Hi TD, I agree I think we are better off either with a full fix or no fix. I am ok with the complete fix being available in master or some branch. I guess the solution for me is to just build from the source. On a similar note, I am not finding any JIRA tickets related to full outer joins and

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread Tathagata Das
I thought about it. I am not 100% sure whether this fix should go into 2.3.1. There are two parts to this bug fix to enable self-joins. 1. Enabling deduping of leaf logical nodes by extending MultInstanceRelation - This is safe to be backported into the 2.3 branch as it does not touch

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread kant kodali
Hi TD, I pulled your commit that is listed on this ticket https://issues.apache.org/jira/browse/SPARK-23406 specifically I did the following steps and self joins work after I cherry-pick your commit! Good Job! I was hoping it will be part of 2.3.0 but looks like it is targeted for 2.3.1 :( git

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread Tathagata Das
Hey, Thanks for testing out stream-stream joins and reporting this issue. I am going to take a look at this. TD On Tue, Feb 20, 2018 at 8:20 PM, kant kodali wrote: > if I change it to the below code it works. However, I don't believe it is > the solution I am looking

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
if I change it to the below code it works. However, I don't believe it is the solution I am looking for. I want to be able to do it in raw SQL and moreover, If a user gives a big chained raw spark SQL join query I am not even sure how to make copies of the dataframe to achieve the self-join. Is

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
If I change it to this On Tue, Feb 20, 2018 at 7:52 PM, kant kodali wrote: > Hi All, > > I have the following code > > import org.apache.spark.sql.streaming.Trigger > > val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", >