Yes, the data is finite (although it comes through PubSub, so I guess is considered unbounded). How could I hold the watermark and prevent it from moving?
Thanks! On Thu, Feb 8, 2018 at 10:06 PM Robert Bradshaw <[email protected]> wrote: > Where is the watermark for this old data coming from? Rather than > messing with allowed lateness, would it be possible to hold the > watermark back appropriately during the time you're injecting old data > (assuming there's only a finite amount of it)? > > On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <[email protected]> > wrote: > > Thanks for your responses!! > > > > I have a scenario where I have to reprocess very disordered data for 4 > or 5 > > years and I don't want to lose any data. I'm thinking of setting a very > big > > allowed lateness (5 years), but before doing that I'd like to understand > the > > consequences that may have. I guess memory wise will be very consuming > as no > > window will ever expire, but I guess I could overcome that with brute > force > > (many machines with many RAM) but, are there more concerns I should be > aware > > of? This should be a one-off thing. > > > > Thanks! > > > > On Thu, Feb 8, 2018 at 6:59 PM Raghu Angadi <[email protected]> wrote: > >> > >> On Wed, Feb 7, 2018 at 10:33 PM, Pawel Bartoszek > >> <[email protected]> wrote: > >>> > >>> Hi Raghu, > >>> Can you provide more details about increasing allowed lateness? Even > if I > >>> do that I still need to compare event time of record with processing > >>> time(system current time) in my ParDo? > >> > >> > >> I see. PaneInfo() associated with each element has 'Timing' enum, so we > >> can tell if the element is late, but it does not tell how late. > >> How about this : We can have a periodic timer firing every minute and > >> store the scheduled time of the timer in state as the watermark time. We > >> could compare element time to this stored time for good approximation > (may > >> require parallel stage with global window, dropping any events that > 'clearly > >> within limits' based on current time). There are probably other ways to > do > >> this with timers within existing stage. > >> > >>> > >>> Pawel > >>> > >>> On 8 February 2018 at 05:40, Raghu Angadi <[email protected]> wrote: > >>>> > >>>> The watermark is not directly available, you essentially infer from > >>>> fired triggers (e.g. fixed windows). I would consider some of these > options > >>>> : > >>>> - Adhoc debugging : if the pipeline is close to realtime, you can > just > >>>> guess if a element will be dropped based on its timestamp and current > time > >>>> in the first stage (before first aggregation) > >>>> - Increase allowed lateness (say to 3 days) and drop the elements > >>>> yourself you notice are later than 1 day. > >>>> - Place the elements into another window with larger allowed > lateness > >>>> and log very late elements in another parallel aggregation (see > >>>> TriggerExample.java in Beam repo). > >>>> > >>>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <[email protected]> > >>>> wrote: > >>>>> > >>>>> Hi everyone!! > >>>>> > >>>>> I have a streaming job running with fixed windows of one hour and > >>>>> allowed lateness of two days and the number of dropped due to > lateness > >>>>> elements is slowly, but continuously growing and I'd like to > understand > >>>>> which elements are those. > >>>>> > >>>>> I'd like to get the watermark from inside the job to compare it > against > >>>>> each element and write log messages with the ones that will be > potentially > >>>>> discarded.... Does that approach make any sense? I which case... How > can I > >>>>> get the watermark from inside the job? Any other ideas? > >>>>> > >>>>> Thanks in advance!! > >>>> > >>>> > >>> > > >
