Hi All,
We have a dataflow job which is consuming from 100s of kafka topics. This
job has time based window aggregations. From time to time we see the jobs
dropping data due to lateness at pipeline start. once the job reaches a
steady state, there is no data loss. We are using a custom timestamp policy
which is moving the watermark based on the message's header/payload to stay
in the event_time domain.
For the data loss following is my theory which i need your input if thats a
valid possibility and if possible are there any solutions to this problem -

say there are n topics - t1, t2..... t_withlag....tn which are being
consumed from.
all topics have almost 0 backlog except for one t_withlag.
The backlog can be assumed to be substantial so that it goes beyond the
allowed lateness.

Job starts -
1) Budle1 starts getting created and let's say the bundle was filled only
from the first few topics. so that runner could not reach the laggy topic.
2) watermark policy is invoked for all topics' partitions which
participated in the above bundle.
3) Runner computed the global watermark based on the above input.
4) Since the laggy topic did not contribute to the watermark calculation
the global watermark was set to ~now()
5) Next bundle starts processing and in this bundle runner started
encountering msges from t_withlag and hence dropping data from those
partitions until it is caught up.

Is this scenario possible? If yes, what are possible solutions to such a
situation?

Reply via email to