[GitHub] spark issue #21890: [SPARK-24932] Allow update mode for streaming queries wi...

2018-08-08 Thread fuyufjh
Github user fuyufjh commented on the issue:

https://github.com/apache/spark/pull/21890
  
Hi @attilapiros , before going on adding test cases (actually I am doing 
this right now), I think there is one more thing need to be figure out first.

As this PR wrote, I want to support stream-stream join in update mode by 
let it behaves exactly same as in append mode. This is totally fine for inner 
join, but not so straight forward for outer join.

For example:

Assuming watermark delay is set to 10 minutes, we run a query like `A left 
outer join B`, while event `A1` comes at 10:01 and event `B1` comes at 10:02. 
In append mode, of course, `A1` will wait for `B1` to produce a join result 
`A1-B1` at 10:02. 

However, in update mode, we can keep this behavior, ***OR*** take actions 
as following, which looks also reasonable but some kind of costly:

1. Emit an `A1-null` at 10:01.
2. Emit an `A1-B1` at 10:02 when B1 appears, and expect the data sink to 
write over previous result.

So which is the *correct* behavior? - The same way as append mode, or the 
above way. Please let me know your opinion.

cc. @jose-torres @tdas 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21890: [SPARK-24932] Allow update mode for streaming que...

2018-07-26 Thread fuyufjh
GitHub user fuyufjh opened a pull request:

https://github.com/apache/spark/pull/21890

[SPARK-24932] Allow update mode for streaming queries with join

## What changes were proposed in this pull request?

In issue SPARK-19140 we supported update output mode for non-aggregation 
streaming queries. This should also be applied to streaming join to keep 
semantic consistent.

PS. Streaming join feature is added after SPARK-19140. 

When using *update* output mode the join will works exactly as *append* 
mode. However, for example, this will allow user to run an 
aggregation-after-join query in *update* mode in order to get a more real-time 
result output.

## How was this patch tested?

See changes in UnsupportedOperationsSuite.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fuyufjh/spark 
SPARK-19140-allow-update-for-stream-join

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21890.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21890


commit f2daf62c4e5d9daf397fc804ed9365204933ddbd
Author: Eric Fu 
Date:   2018-07-27T02:55:17Z

[SPARK-24932] Allow update mode for streaming queries with join




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org