Hi Carlos,

I wanted to mention that elements are not discarded by comparison of their
timestamp with the watermark. They are discarded when their window has
expired - when the watermark is passed the end of the window plus allowed
lateness. An element just has to that arrive for aggregation - such as GBK
or Combine or stateful ParDo - before the window is expired. This drops a
lot less data and is a bit less arbitrary than comparing element timestamp
versus watermmark.

I wish I had a better answer for how to gain insights into the data being
dropped. That's something we should think about.

Kenn

On Wed, Feb 7, 2018 at 10:30 PM, Carlos Alonso <car...@mrcalonso.com> wrote:

> Cool, I'll try some of these. Thanks Raghu!
>
> On Thu, Feb 8, 2018 at 6:40 AM Raghu Angadi <rang...@google.com> wrote:
>
>> The watermark is not directly available, you essentially infer from fired
>> triggers (e.g. fixed windows). I would consider some of these options :
>>   - Adhoc debugging : if the pipeline is close to realtime, you can just
>> guess if a element will be dropped based on its timestamp and current time
>> in the first stage (before first aggregation)
>>   - Increase allowed lateness (say to 3 days) and drop the elements
>> yourself you notice are later than 1 day.
>>   - Place the elements into another window with larger allowed lateness
>> and log very late elements in another parallel aggregation (see
>> TriggerExample.java in Beam repo).
>>
>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <car...@mrcalonso.com>
>> wrote:
>>
>>> Hi everyone!!
>>>
>>> I have a streaming job running with fixed windows of one hour and
>>> allowed lateness of two days and the number of dropped due to lateness
>>> elements is slowly, but continuously growing and I'd like to understand
>>> which elements are those.
>>>
>>> I'd like to get the watermark from inside the job to compare it against
>>> each element and write log messages with the ones that will be potentially
>>> discarded.... Does that approach make any sense? I which case... How can I
>>> get the watermark from inside the job? Any other ideas?
>>>
>>> Thanks in advance!!
>>>
>>
>>

Reply via email to