What are the advantages of holding back watermark over allowed lateness
(duration of both being approximately same)?

On Thu, Feb 8, 2018 at 1:38 PM, Robert Bradshaw <[email protected]> wrote:

> You can set the timestamp attribute of your pubsub messages which will
> hold back the watermark, see
>
> https://beam.apache.org/documentation/sdks/javadoc/2.
> 2.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#
> withTimestampAttribute-java.lang.String-
>
> However, if you're mixing historical and live data, it may make more
> sense to read these as two separate sources (e.g. the static data from
> a set of files, the live data from pubsub) and then flatten them for
> further processing.
>
> On Thu, Feb 8, 2018 at 1:23 PM, Carlos Alonso <[email protected]>
> wrote:
> > Yes, the data is finite (although it comes through PubSub, so I guess is
> > considered unbounded).
> > How could I hold the watermark and prevent it from moving?
> >
> > Thanks!
> >
> > On Thu, Feb 8, 2018 at 10:06 PM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> Where is the watermark for this old data coming from? Rather than
> >> messing with allowed lateness, would it be possible to hold the
> >> watermark back appropriately during the time you're injecting old data
> >> (assuming there's only a finite amount of it)?
> >>
> >> On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <[email protected]>
> >> wrote:
> >> > Thanks for your responses!!
> >> >
> >> > I have a scenario where I have to reprocess very disordered data for 4
> >> > or 5
> >> > years and I don't want to lose any data. I'm thinking of setting a
> very
> >> > big
> >> > allowed lateness (5 years), but before doing that I'd like to
> understand
> >> > the
> >> > consequences that may have. I guess memory wise will be very consuming
> >> > as no
> >> > window will ever expire, but I guess I could overcome that with brute
> >> > force
> >> > (many machines with many RAM) but, are there more concerns I should be
> >> > aware
> >> > of? This should be a one-off thing.
> >> >
> >> > Thanks!
> >> >
> >> > On Thu, Feb 8, 2018 at 6:59 PM Raghu Angadi <[email protected]>
> wrote:
> >> >>
> >> >> On Wed, Feb 7, 2018 at 10:33 PM, Pawel Bartoszek
> >> >> <[email protected]> wrote:
> >> >>>
> >> >>> Hi Raghu,
> >> >>> Can you provide more details about increasing allowed lateness? Even
> >> >>> if I
> >> >>> do that I still need to compare event time of record with processing
> >> >>> time(system current time) in my ParDo?
> >> >>
> >> >>
> >> >> I see. PaneInfo() associated with each element has 'Timing' enum, so
> we
> >> >> can tell if the element is late, but it does not tell how late.
> >> >> How about this : We can have a periodic timer firing every minute and
> >> >> store the scheduled time of the timer in state as the watermark time.
> >> >> We
> >> >> could compare element time to this stored time for good approximation
> >> >> (may
> >> >> require parallel stage with global window, dropping any events that
> >> >> 'clearly
> >> >> within limits' based on current time). There are probably other ways
> to
> >> >> do
> >> >> this with timers within existing stage.
> >> >>
> >> >>>
> >> >>> Pawel
> >> >>>
> >> >>> On 8 February 2018 at 05:40, Raghu Angadi <[email protected]>
> wrote:
> >> >>>>
> >> >>>> The watermark is not directly available, you essentially infer from
> >> >>>> fired triggers (e.g. fixed windows). I would consider some of these
> >> >>>> options
> >> >>>> :
> >> >>>>   - Adhoc debugging : if the pipeline is close to realtime, you can
> >> >>>> just
> >> >>>> guess if a element will be dropped based on its timestamp and
> current
> >> >>>> time
> >> >>>> in the first stage (before first aggregation)
> >> >>>>   - Increase allowed lateness (say to 3 days) and drop the elements
> >> >>>> yourself you notice are later than 1 day.
> >> >>>>   - Place the elements into another window with larger allowed
> >> >>>> lateness
> >> >>>> and log very late elements in another parallel aggregation (see
> >> >>>> TriggerExample.java in Beam repo).
> >> >>>>
> >> >>>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <
> [email protected]>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Hi everyone!!
> >> >>>>>
> >> >>>>> I have a streaming job running with fixed windows of one hour and
> >> >>>>> allowed lateness of two days and the number of dropped due to
> >> >>>>> lateness
> >> >>>>> elements is slowly, but continuously growing and I'd like to
> >> >>>>> understand
> >> >>>>> which elements are those.
> >> >>>>>
> >> >>>>> I'd like to get the watermark from inside the job to compare it
> >> >>>>> against
> >> >>>>> each element and write log messages with the ones that will be
> >> >>>>> potentially
> >> >>>>> discarded.... Does that approach make any sense? I which case...
> How
> >> >>>>> can I
> >> >>>>> get the watermark from inside the job? Any other ideas?
> >> >>>>>
> >> >>>>> Thanks in advance!!
> >> >>>>
> >> >>>>
> >> >>>
> >> >
>

Reply via email to