Where is the watermark for this old data coming from? Rather than
messing with allowed lateness, would it be possible to hold the
watermark back appropriately during the time you're injecting old data
(assuming there's only a finite amount of it)?

On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <car...@mrcalonso.com> wrote:
> Thanks for your responses!!
>
> I have a scenario where I have to reprocess very disordered data for 4 or 5
> years and I don't want to lose any data. I'm thinking of setting a very big
> allowed lateness (5 years), but before doing that I'd like to understand the
> consequences that may have. I guess memory wise will be very consuming as no
> window will ever expire, but I guess I could overcome that with brute force
> (many machines with many RAM) but, are there more concerns I should be aware
> of? This should be a one-off thing.
>
> Thanks!
>
> On Thu, Feb 8, 2018 at 6:59 PM Raghu Angadi <rang...@google.com> wrote:
>>
>> On Wed, Feb 7, 2018 at 10:33 PM, Pawel Bartoszek
>> <pawelbartosze...@gmail.com> wrote:
>>>
>>> Hi Raghu,
>>> Can you provide more details about increasing allowed lateness? Even if I
>>> do that I still need to compare event time of record with processing
>>> time(system current time) in my ParDo?
>>
>>
>> I see. PaneInfo() associated with each element has 'Timing' enum, so we
>> can tell if the element is late, but it does not tell how late.
>> How about this : We can have a periodic timer firing every minute and
>> store the scheduled time of the timer in state as the watermark time. We
>> could compare element time to this stored time for good approximation (may
>> require parallel stage with global window, dropping any events that 'clearly
>> within limits' based on current time). There are probably other ways to do
>> this with timers within existing stage.
>>
>>>
>>> Pawel
>>>
>>> On 8 February 2018 at 05:40, Raghu Angadi <rang...@google.com> wrote:
>>>>
>>>> The watermark is not directly available, you essentially infer from
>>>> fired triggers (e.g. fixed windows). I would consider some of these options
>>>> :
>>>>   - Adhoc debugging : if the pipeline is close to realtime, you can just
>>>> guess if a element will be dropped based on its timestamp and current time
>>>> in the first stage (before first aggregation)
>>>>   - Increase allowed lateness (say to 3 days) and drop the elements
>>>> yourself you notice are later than 1 day.
>>>>   - Place the elements into another window with larger allowed lateness
>>>> and log very late elements in another parallel aggregation (see
>>>> TriggerExample.java in Beam repo).
>>>>
>>>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <car...@mrcalonso.com>
>>>> wrote:
>>>>>
>>>>> Hi everyone!!
>>>>>
>>>>> I have a streaming job running with fixed windows of one hour and
>>>>> allowed lateness of two days and the number of dropped due to lateness
>>>>> elements is slowly, but continuously growing and I'd like to understand
>>>>> which elements are those.
>>>>>
>>>>> I'd like to get the watermark from inside the job to compare it against
>>>>> each element and write log messages with the ones that will be potentially
>>>>> discarded.... Does that approach make any sense? I which case... How can I
>>>>> get the watermark from inside the job? Any other ideas?
>>>>>
>>>>> Thanks in advance!!
>>>>
>>>>
>>>
>

Reply via email to