Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

David Anderson Tue, 18 Aug 2020 10:13:20 -0700

Being able to optionally fire registered processing time timers at the end
of a job would be interesting, and would help in (at least some of) the
cases I have in mind. I don't have a better idea.


David

On Mon, Aug 17, 2020 at 8:24 PM Kostas Kloudas <kklou...@apache.org> wrote:

> Hi Kurt and David,
>
> Thanks a lot for the insightful feedback!
>
> @Kurt: For the topic of checkpointing with Batch Scheduling, I totally
> agree with you that it requires a lot more work and careful thinking
> on the semantics. This FLIP was written under the assumption that if
> the user wants to have checkpoints on bounded input, he/she will have
> to go with STREAMING as the scheduling mode. Checkpointing for BATCH
> can be handled as a separate topic in the future.
>
> In the case of MIXED workloads and for this FLIP, the scheduling mode
> should be set to STREAMING. That is why the AUTOMATIC option sets
> scheduling to BATCH only if all the sources are bounded. I am not sure
> what are the plans there at the scheduling level, as one could imagine
> in the future that in mixed workloads, we schedule first all the
> bounded subgraphs in BATCH mode and we allow only one UNBOUNDED
> subgraph per application, which is going to be scheduled after all
> Bounded ones have finished. Essentially the bounded subgraphs will be
> used to bootstrap the unbounded one. But, I am not aware of any plans
> towards that direction.
>
>
> @David: The processing time timer handling is a topic that has also
> been discussed in the community in the past, and I do not remember any
> final conclusion unfortunately.
>
> In the current context and for bounded input, we chose to favor
> reproducibility of the result, as this is expected in batch processing
> where the whole input is available in advance. This is why this
> proposal suggests to not allow processing time timers. But I
> understand your argument that the user may want to be able to run the
> same pipeline on batch and streaming this is why we added the two
> options under future work, namely (from the FLIP):
>
> ```
> Future Work: In the future we may consider adding as options the
> capability of:
> * firing all the registered processing time timers at the end of a job
> (at close()) or,
> * ignoring all the registered processing time timers at the end of a job.
> ```
>
> Conceptually, we are essentially saying that we assume that batch
> execution is assumed to be instantaneous and refers to a single
> "point" in time and any processing-time timers for the future may fire
> at the end of execution or be ignored (but not throw an exception). I
> could also see ignoring the timers in batch as the default, if this
> makes more sense.
>
> By the way, do you have any usecases in mind that will help us better
> shape our processing time timer handling?
>
> Kostas
>
> On Mon, Aug 17, 2020 at 2:52 PM David Anderson <da...@alpinegizmo.com>
> wrote:
> >
> > Kostas,
> >
> > I'm pleased to see some concrete details in this FLIP.
> >
> > I wonder if the current proposal goes far enough in the direction of
> recognizing the need some users may have for "batch" and "bounded
> streaming" to be treated differently. If I've understood it correctly, the
> section on scheduling allows me to choose STREAMING scheduling even if I
> have bounded sources. I like that approach, because it recognizes that even
> though I have bounded inputs, I don't necessarily want batch processing
> semantics. I think it makes sense to extend this idea to processing time
> support as well.
> >
> > My thinking is that sometimes in development and testing it's reasonable
> to run exactly the same job as in production, except with different sources
> and sinks. While it might be a reasonable default, I'm not convinced that
> switching a processing time streaming job to read from a bounded source
> should always cause it to fail.
> >
> > David
> >
> > On Wed, Aug 12, 2020 at 5:22 PM Kostas Kloudas <kklou...@apache.org>
> wrote:
> >>
> >> Hi all,
> >>
> >> As described in FLIP-131 [1], we are aiming at deprecating the DataSet
> >> API in favour of the DataStream API and the Table API. After this work
> >> is done, the user will be able to write a program using the DataStream
> >> API and this will execute efficiently on both bounded and unbounded
> >> data. But before we reach this point, it is worth discussing and
> >> agreeing on the semantics of some operations as we transition from the
> >> streaming world to the batch one.
> >>
> >> This thread and the associated FLIP [2] aim at discussing these issues
> >> as these topics are pretty important to users and can lead to
> >> unpleasant surprises if we do not pay attention.
> >>
> >> Let's have a healthy discussion here and I will be updating the FLIP
> >> accordingly.
> >>
> >> Cheers,
> >> Kostas
> >>
> >> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741
> >> [2]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158871522
>

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

Reply via email to