Being able to optionally fire registered processing time timers at the end of a job would be interesting, and would help in (at least some of) the cases I have in mind. I don't have a better idea.
David On Mon, Aug 17, 2020 at 8:24 PM Kostas Kloudas <kklou...@apache.org> wrote: > Hi Kurt and David, > > Thanks a lot for the insightful feedback! > > @Kurt: For the topic of checkpointing with Batch Scheduling, I totally > agree with you that it requires a lot more work and careful thinking > on the semantics. This FLIP was written under the assumption that if > the user wants to have checkpoints on bounded input, he/she will have > to go with STREAMING as the scheduling mode. Checkpointing for BATCH > can be handled as a separate topic in the future. > > In the case of MIXED workloads and for this FLIP, the scheduling mode > should be set to STREAMING. That is why the AUTOMATIC option sets > scheduling to BATCH only if all the sources are bounded. I am not sure > what are the plans there at the scheduling level, as one could imagine > in the future that in mixed workloads, we schedule first all the > bounded subgraphs in BATCH mode and we allow only one UNBOUNDED > subgraph per application, which is going to be scheduled after all > Bounded ones have finished. Essentially the bounded subgraphs will be > used to bootstrap the unbounded one. But, I am not aware of any plans > towards that direction. > > > @David: The processing time timer handling is a topic that has also > been discussed in the community in the past, and I do not remember any > final conclusion unfortunately. > > In the current context and for bounded input, we chose to favor > reproducibility of the result, as this is expected in batch processing > where the whole input is available in advance. This is why this > proposal suggests to not allow processing time timers. But I > understand your argument that the user may want to be able to run the > same pipeline on batch and streaming this is why we added the two > options under future work, namely (from the FLIP): > > ``` > Future Work: In the future we may consider adding as options the > capability of: > * firing all the registered processing time timers at the end of a job > (at close()) or, > * ignoring all the registered processing time timers at the end of a job. > ``` > > Conceptually, we are essentially saying that we assume that batch > execution is assumed to be instantaneous and refers to a single > "point" in time and any processing-time timers for the future may fire > at the end of execution or be ignored (but not throw an exception). I > could also see ignoring the timers in batch as the default, if this > makes more sense. > > By the way, do you have any usecases in mind that will help us better > shape our processing time timer handling? > > Kostas > > On Mon, Aug 17, 2020 at 2:52 PM David Anderson <da...@alpinegizmo.com> > wrote: > > > > Kostas, > > > > I'm pleased to see some concrete details in this FLIP. > > > > I wonder if the current proposal goes far enough in the direction of > recognizing the need some users may have for "batch" and "bounded > streaming" to be treated differently. If I've understood it correctly, the > section on scheduling allows me to choose STREAMING scheduling even if I > have bounded sources. I like that approach, because it recognizes that even > though I have bounded inputs, I don't necessarily want batch processing > semantics. I think it makes sense to extend this idea to processing time > support as well. > > > > My thinking is that sometimes in development and testing it's reasonable > to run exactly the same job as in production, except with different sources > and sinks. While it might be a reasonable default, I'm not convinced that > switching a processing time streaming job to read from a bounded source > should always cause it to fail. > > > > David > > > > On Wed, Aug 12, 2020 at 5:22 PM Kostas Kloudas <kklou...@apache.org> > wrote: > >> > >> Hi all, > >> > >> As described in FLIP-131 [1], we are aiming at deprecating the DataSet > >> API in favour of the DataStream API and the Table API. After this work > >> is done, the user will be able to write a program using the DataStream > >> API and this will execute efficiently on both bounded and unbounded > >> data. But before we reach this point, it is worth discussing and > >> agreeing on the semantics of some operations as we transition from the > >> streaming world to the batch one. > >> > >> This thread and the associated FLIP [2] aim at discussing these issues > >> as these topics are pretty important to users and can lead to > >> unpleasant surprises if we do not pay attention. > >> > >> Let's have a healthy discussion here and I will be updating the FLIP > >> accordingly. > >> > >> Cheers, > >> Kostas > >> > >> [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741 > >> [2] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158871522 >