What are the advantages of holding back watermark over allowed lateness (duration of both being approximately same)?
On Thu, Feb 8, 2018 at 1:38 PM, Robert Bradshaw <[email protected]> wrote: > You can set the timestamp attribute of your pubsub messages which will > hold back the watermark, see > > https://beam.apache.org/documentation/sdks/javadoc/2. > 2.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html# > withTimestampAttribute-java.lang.String- > > However, if you're mixing historical and live data, it may make more > sense to read these as two separate sources (e.g. the static data from > a set of files, the live data from pubsub) and then flatten them for > further processing. > > On Thu, Feb 8, 2018 at 1:23 PM, Carlos Alonso <[email protected]> > wrote: > > Yes, the data is finite (although it comes through PubSub, so I guess is > > considered unbounded). > > How could I hold the watermark and prevent it from moving? > > > > Thanks! > > > > On Thu, Feb 8, 2018 at 10:06 PM Robert Bradshaw <[email protected]> > wrote: > >> > >> Where is the watermark for this old data coming from? Rather than > >> messing with allowed lateness, would it be possible to hold the > >> watermark back appropriately during the time you're injecting old data > >> (assuming there's only a finite amount of it)? > >> > >> On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <[email protected]> > >> wrote: > >> > Thanks for your responses!! > >> > > >> > I have a scenario where I have to reprocess very disordered data for 4 > >> > or 5 > >> > years and I don't want to lose any data. I'm thinking of setting a > very > >> > big > >> > allowed lateness (5 years), but before doing that I'd like to > understand > >> > the > >> > consequences that may have. I guess memory wise will be very consuming > >> > as no > >> > window will ever expire, but I guess I could overcome that with brute > >> > force > >> > (many machines with many RAM) but, are there more concerns I should be > >> > aware > >> > of? This should be a one-off thing. > >> > > >> > Thanks! > >> > > >> > On Thu, Feb 8, 2018 at 6:59 PM Raghu Angadi <[email protected]> > wrote: > >> >> > >> >> On Wed, Feb 7, 2018 at 10:33 PM, Pawel Bartoszek > >> >> <[email protected]> wrote: > >> >>> > >> >>> Hi Raghu, > >> >>> Can you provide more details about increasing allowed lateness? Even > >> >>> if I > >> >>> do that I still need to compare event time of record with processing > >> >>> time(system current time) in my ParDo? > >> >> > >> >> > >> >> I see. PaneInfo() associated with each element has 'Timing' enum, so > we > >> >> can tell if the element is late, but it does not tell how late. > >> >> How about this : We can have a periodic timer firing every minute and > >> >> store the scheduled time of the timer in state as the watermark time. > >> >> We > >> >> could compare element time to this stored time for good approximation > >> >> (may > >> >> require parallel stage with global window, dropping any events that > >> >> 'clearly > >> >> within limits' based on current time). There are probably other ways > to > >> >> do > >> >> this with timers within existing stage. > >> >> > >> >>> > >> >>> Pawel > >> >>> > >> >>> On 8 February 2018 at 05:40, Raghu Angadi <[email protected]> > wrote: > >> >>>> > >> >>>> The watermark is not directly available, you essentially infer from > >> >>>> fired triggers (e.g. fixed windows). I would consider some of these > >> >>>> options > >> >>>> : > >> >>>> - Adhoc debugging : if the pipeline is close to realtime, you can > >> >>>> just > >> >>>> guess if a element will be dropped based on its timestamp and > current > >> >>>> time > >> >>>> in the first stage (before first aggregation) > >> >>>> - Increase allowed lateness (say to 3 days) and drop the elements > >> >>>> yourself you notice are later than 1 day. > >> >>>> - Place the elements into another window with larger allowed > >> >>>> lateness > >> >>>> and log very late elements in another parallel aggregation (see > >> >>>> TriggerExample.java in Beam repo). > >> >>>> > >> >>>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso < > [email protected]> > >> >>>> wrote: > >> >>>>> > >> >>>>> Hi everyone!! > >> >>>>> > >> >>>>> I have a streaming job running with fixed windows of one hour and > >> >>>>> allowed lateness of two days and the number of dropped due to > >> >>>>> lateness > >> >>>>> elements is slowly, but continuously growing and I'd like to > >> >>>>> understand > >> >>>>> which elements are those. > >> >>>>> > >> >>>>> I'd like to get the watermark from inside the job to compare it > >> >>>>> against > >> >>>>> each element and write log messages with the ones that will be > >> >>>>> potentially > >> >>>>> discarded.... Does that approach make any sense? I which case... > How > >> >>>>> can I > >> >>>>> get the watermark from inside the job? Any other ideas? > >> >>>>> > >> >>>>> Thanks in advance!! > >> >>>> > >> >>>> > >> >>> > >> > >
