Can you explain more about your custom watermark policy? On Sun, Aug 24, 2025 at 11:43 AM gaurav mishra <[email protected]> wrote:
> Hi All, > We have a dataflow job which is consuming from 100s of kafka topics. This > job has time based window aggregations. From time to time we see the jobs > dropping data due to lateness at pipeline start. once the job reaches a > steady state, there is no data loss. We are using a custom timestamp policy > which is moving the watermark based on the message's header/payload to stay > in the event_time domain. > For the data loss following is my theory which i need your input if thats > a valid possibility and if possible are there any solutions to this problem > - > > say there are n topics - t1, t2..... t_withlag....tn which are being > consumed from. > all topics have almost 0 backlog except for one t_withlag. > The backlog can be assumed to be substantial so that it goes beyond the > allowed lateness. > > Job starts - > 1) Budle1 starts getting created and let's say the bundle was filled only > from the first few topics. so that runner could not reach the laggy topic. > 2) watermark policy is invoked for all topics' partitions which > participated in the above bundle. > 3) Runner computed the global watermark based on the above input. > 4) Since the laggy topic did not contribute to the watermark calculation > the global watermark was set to ~now() > 5) Next bundle starts processing and in this bundle runner started > encountering msges from t_withlag and hence dropping data from those > partitions until it is caught up. > > Is this scenario possible? If yes, what are possible solutions to such a > situation? > > > > > > > > >
