I thought of loading all of it into the PubSub subscription before starting the job. That should work, right? Any better suggestion? On Thu, 8 Feb 2018 at 22:23, Carlos Alonso <car...@mrcalonso.com> wrote:
> Yes, the data is finite (although it comes through PubSub, so I guess is > considered unbounded). > How could I hold the watermark and prevent it from moving? > > Thanks! > > On Thu, Feb 8, 2018 at 10:06 PM Robert Bradshaw <rober...@google.com> > wrote: > >> Where is the watermark for this old data coming from? Rather than >> messing with allowed lateness, would it be possible to hold the >> watermark back appropriately during the time you're injecting old data >> (assuming there's only a finite amount of it)? >> >> On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <car...@mrcalonso.com> >> wrote: >> > Thanks for your responses!! >> > >> > I have a scenario where I have to reprocess very disordered data for 4 >> or 5 >> > years and I don't want to lose any data. I'm thinking of setting a very >> big >> > allowed lateness (5 years), but before doing that I'd like to >> understand the >> > consequences that may have. I guess memory wise will be very consuming >> as no >> > window will ever expire, but I guess I could overcome that with brute >> force >> > (many machines with many RAM) but, are there more concerns I should be >> aware >> > of? This should be a one-off thing. >> > >> > Thanks! >> > >> > On Thu, Feb 8, 2018 at 6:59 PM Raghu Angadi <rang...@google.com> wrote: >> >> >> >> On Wed, Feb 7, 2018 at 10:33 PM, Pawel Bartoszek >> >> <pawelbartosze...@gmail.com> wrote: >> >>> >> >>> Hi Raghu, >> >>> Can you provide more details about increasing allowed lateness? Even >> if I >> >>> do that I still need to compare event time of record with processing >> >>> time(system current time) in my ParDo? >> >> >> >> >> >> I see. PaneInfo() associated with each element has 'Timing' enum, so we >> >> can tell if the element is late, but it does not tell how late. >> >> How about this : We can have a periodic timer firing every minute and >> >> store the scheduled time of the timer in state as the watermark time. >> We >> >> could compare element time to this stored time for good approximation >> (may >> >> require parallel stage with global window, dropping any events that >> 'clearly >> >> within limits' based on current time). There are probably other ways >> to do >> >> this with timers within existing stage. >> >> >> >>> >> >>> Pawel >> >>> >> >>> On 8 February 2018 at 05:40, Raghu Angadi <rang...@google.com> wrote: >> >>>> >> >>>> The watermark is not directly available, you essentially infer from >> >>>> fired triggers (e.g. fixed windows). I would consider some of these >> options >> >>>> : >> >>>> - Adhoc debugging : if the pipeline is close to realtime, you can >> just >> >>>> guess if a element will be dropped based on its timestamp and >> current time >> >>>> in the first stage (before first aggregation) >> >>>> - Increase allowed lateness (say to 3 days) and drop the elements >> >>>> yourself you notice are later than 1 day. >> >>>> - Place the elements into another window with larger allowed >> lateness >> >>>> and log very late elements in another parallel aggregation (see >> >>>> TriggerExample.java in Beam repo). >> >>>> >> >>>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <car...@mrcalonso.com> >> >>>> wrote: >> >>>>> >> >>>>> Hi everyone!! >> >>>>> >> >>>>> I have a streaming job running with fixed windows of one hour and >> >>>>> allowed lateness of two days and the number of dropped due to >> lateness >> >>>>> elements is slowly, but continuously growing and I'd like to >> understand >> >>>>> which elements are those. >> >>>>> >> >>>>> I'd like to get the watermark from inside the job to compare it >> against >> >>>>> each element and write log messages with the ones that will be >> potentially >> >>>>> discarded.... Does that approach make any sense? I which case... >> How can I >> >>>>> get the watermark from inside the job? Any other ideas? >> >>>>> >> >>>>> Thanks in advance!! >> >>>> >> >>>> >> >>> >> > >> >