Got the point. If you would like to get "correct" output, you may need to
set global watermark as "min", because watermark is not only used for
evicting rows in state, but also discarding input rows later than
watermark. Here you may want to be aware that there're two stateful
operators which will receive inputs from previous stage and discard them
via watermark before processing.

Btw, you may also need to consider the difference of the concept of
watermark between Spark and others:

1. Spark uses high watermark (picks highest event timestamp of input rows)
even for single watermark whereas other frameworks use low watermark (picks
lowest event timestamp of input rows). So you may always need to set enough
delay on watermark.

2. Spark uses global watermark whereas other frameworks normally use
operator-wise watermark. This is limitation of Spark (given outputs of
previous stateful operator will become inputs of next stateful operator,
they should have different watermark) and one of contributor proposes the
approach [1] which would fit for Spark (unfortunately it haven't been
reviewed by committers so long).

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://github.com/apache/spark/pull/23576

On Tue, Jun 11, 2019 at 7:06 AM Joe Ammann <j...@pyx.ch> wrote:

> Hi all
>
> it took me some time to get the issues extracted into a piece of
> standalone code. I created the following gist
>
> https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17
>
> I has messages for 4 topics A/B/C/D and a simple Python program which
> shows 6 use cases, with my expectations and observations with Spark 2.4.3
>
> It would be great if you could have a look and check if I'm doing
> something wrong, or this is indeed a limitation of Spark?
>
> On 6/5/19 5:35 PM, Jungtaek Lim wrote:
> > Nice to hear you're investigating the issue deeply.
> >
> > Btw, if attaching code is not easy, maybe you could share
> logical/physical plan on any batch: "detail" in SQL tab would show up the
> plan as string. Plans from sequential batches would be much helpful - and
> streaming query status in these batch (especially watermark) should be
> helpful too.
> >
>
>
> --
> CU, Joe
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Reply via email to