Hi Carlos, I wanted to mention that elements are not discarded by comparison of their timestamp with the watermark. They are discarded when their window has expired - when the watermark is passed the end of the window plus allowed lateness. An element just has to that arrive for aggregation - such as GBK or Combine or stateful ParDo - before the window is expired. This drops a lot less data and is a bit less arbitrary than comparing element timestamp versus watermmark.
I wish I had a better answer for how to gain insights into the data being dropped. That's something we should think about. Kenn On Wed, Feb 7, 2018 at 10:30 PM, Carlos Alonso <car...@mrcalonso.com> wrote: > Cool, I'll try some of these. Thanks Raghu! > > On Thu, Feb 8, 2018 at 6:40 AM Raghu Angadi <rang...@google.com> wrote: > >> The watermark is not directly available, you essentially infer from fired >> triggers (e.g. fixed windows). I would consider some of these options : >> - Adhoc debugging : if the pipeline is close to realtime, you can just >> guess if a element will be dropped based on its timestamp and current time >> in the first stage (before first aggregation) >> - Increase allowed lateness (say to 3 days) and drop the elements >> yourself you notice are later than 1 day. >> - Place the elements into another window with larger allowed lateness >> and log very late elements in another parallel aggregation (see >> TriggerExample.java in Beam repo). >> >> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <car...@mrcalonso.com> >> wrote: >> >>> Hi everyone!! >>> >>> I have a streaming job running with fixed windows of one hour and >>> allowed lateness of two days and the number of dropped due to lateness >>> elements is slowly, but continuously growing and I'd like to understand >>> which elements are those. >>> >>> I'd like to get the watermark from inside the job to compare it against >>> each element and write log messages with the ones that will be potentially >>> discarded.... Does that approach make any sense? I which case... How can I >>> get the watermark from inside the job? Any other ideas? >>> >>> Thanks in advance!! >>> >> >>