Hi All, We have a dataflow job which is consuming from 100s of kafka topics. This job has time based window aggregations. From time to time we see the jobs dropping data due to lateness at pipeline start. once the job reaches a steady state, there is no data loss. We are using a custom timestamp policy which is moving the watermark based on the message's header/payload to stay in the event_time domain. For the data loss following is my theory which i need your input if thats a valid possibility and if possible are there any solutions to this problem -
say there are n topics - t1, t2..... t_withlag....tn which are being consumed from. all topics have almost 0 backlog except for one t_withlag. The backlog can be assumed to be substantial so that it goes beyond the allowed lateness. Job starts - 1) Budle1 starts getting created and let's say the bundle was filled only from the first few topics. so that runner could not reach the laggy topic. 2) watermark policy is invoked for all topics' partitions which participated in the above bundle. 3) Runner computed the global watermark based on the above input. 4) Since the laggy topic did not contribute to the watermark calculation the global watermark was set to ~now() 5) Next bundle starts processing and in this bundle runner started encountering msges from t_withlag and hence dropping data from those partitions until it is caught up. Is this scenario possible? If yes, what are possible solutions to such a situation?