Yes, the data is finite (although it comes through PubSub, so I guess is
considered unbounded).
How could I hold the watermark and prevent it from moving?

Thanks!

On Thu, Feb 8, 2018 at 10:06 PM Robert Bradshaw <[email protected]> wrote:

> Where is the watermark for this old data coming from? Rather than
> messing with allowed lateness, would it be possible to hold the
> watermark back appropriately during the time you're injecting old data
> (assuming there's only a finite amount of it)?
>
> On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <[email protected]>
> wrote:
> > Thanks for your responses!!
> >
> > I have a scenario where I have to reprocess very disordered data for 4
> or 5
> > years and I don't want to lose any data. I'm thinking of setting a very
> big
> > allowed lateness (5 years), but before doing that I'd like to understand
> the
> > consequences that may have. I guess memory wise will be very consuming
> as no
> > window will ever expire, but I guess I could overcome that with brute
> force
> > (many machines with many RAM) but, are there more concerns I should be
> aware
> > of? This should be a one-off thing.
> >
> > Thanks!
> >
> > On Thu, Feb 8, 2018 at 6:59 PM Raghu Angadi <[email protected]> wrote:
> >>
> >> On Wed, Feb 7, 2018 at 10:33 PM, Pawel Bartoszek
> >> <[email protected]> wrote:
> >>>
> >>> Hi Raghu,
> >>> Can you provide more details about increasing allowed lateness? Even
> if I
> >>> do that I still need to compare event time of record with processing
> >>> time(system current time) in my ParDo?
> >>
> >>
> >> I see. PaneInfo() associated with each element has 'Timing' enum, so we
> >> can tell if the element is late, but it does not tell how late.
> >> How about this : We can have a periodic timer firing every minute and
> >> store the scheduled time of the timer in state as the watermark time. We
> >> could compare element time to this stored time for good approximation
> (may
> >> require parallel stage with global window, dropping any events that
> 'clearly
> >> within limits' based on current time). There are probably other ways to
> do
> >> this with timers within existing stage.
> >>
> >>>
> >>> Pawel
> >>>
> >>> On 8 February 2018 at 05:40, Raghu Angadi <[email protected]> wrote:
> >>>>
> >>>> The watermark is not directly available, you essentially infer from
> >>>> fired triggers (e.g. fixed windows). I would consider some of these
> options
> >>>> :
> >>>>   - Adhoc debugging : if the pipeline is close to realtime, you can
> just
> >>>> guess if a element will be dropped based on its timestamp and current
> time
> >>>> in the first stage (before first aggregation)
> >>>>   - Increase allowed lateness (say to 3 days) and drop the elements
> >>>> yourself you notice are later than 1 day.
> >>>>   - Place the elements into another window with larger allowed
> lateness
> >>>> and log very late elements in another parallel aggregation (see
> >>>> TriggerExample.java in Beam repo).
> >>>>
> >>>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hi everyone!!
> >>>>>
> >>>>> I have a streaming job running with fixed windows of one hour and
> >>>>> allowed lateness of two days and the number of dropped due to
> lateness
> >>>>> elements is slowly, but continuously growing and I'd like to
> understand
> >>>>> which elements are those.
> >>>>>
> >>>>> I'd like to get the watermark from inside the job to compare it
> against
> >>>>> each element and write log messages with the ones that will be
> potentially
> >>>>> discarded.... Does that approach make any sense? I which case... How
> can I
> >>>>> get the watermark from inside the job? Any other ideas?
> >>>>>
> >>>>> Thanks in advance!!
> >>>>
> >>>>
> >>>
> >
>

Reply via email to