I thought of loading all of it into the PubSub subscription before starting
the job. That should work, right? Any better suggestion?
On Thu, 8 Feb 2018 at 22:23, Carlos Alonso <car...@mrcalonso.com> wrote:

> Yes, the data is finite (although it comes through PubSub, so I guess is
> considered unbounded).
> How could I hold the watermark and prevent it from moving?
>
> Thanks!
>
> On Thu, Feb 8, 2018 at 10:06 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> Where is the watermark for this old data coming from? Rather than
>> messing with allowed lateness, would it be possible to hold the
>> watermark back appropriately during the time you're injecting old data
>> (assuming there's only a finite amount of it)?
>>
>> On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <car...@mrcalonso.com>
>> wrote:
>> > Thanks for your responses!!
>> >
>> > I have a scenario where I have to reprocess very disordered data for 4
>> or 5
>> > years and I don't want to lose any data. I'm thinking of setting a very
>> big
>> > allowed lateness (5 years), but before doing that I'd like to
>> understand the
>> > consequences that may have. I guess memory wise will be very consuming
>> as no
>> > window will ever expire, but I guess I could overcome that with brute
>> force
>> > (many machines with many RAM) but, are there more concerns I should be
>> aware
>> > of? This should be a one-off thing.
>> >
>> > Thanks!
>> >
>> > On Thu, Feb 8, 2018 at 6:59 PM Raghu Angadi <rang...@google.com> wrote:
>> >>
>> >> On Wed, Feb 7, 2018 at 10:33 PM, Pawel Bartoszek
>> >> <pawelbartosze...@gmail.com> wrote:
>> >>>
>> >>> Hi Raghu,
>> >>> Can you provide more details about increasing allowed lateness? Even
>> if I
>> >>> do that I still need to compare event time of record with processing
>> >>> time(system current time) in my ParDo?
>> >>
>> >>
>> >> I see. PaneInfo() associated with each element has 'Timing' enum, so we
>> >> can tell if the element is late, but it does not tell how late.
>> >> How about this : We can have a periodic timer firing every minute and
>> >> store the scheduled time of the timer in state as the watermark time.
>> We
>> >> could compare element time to this stored time for good approximation
>> (may
>> >> require parallel stage with global window, dropping any events that
>> 'clearly
>> >> within limits' based on current time). There are probably other ways
>> to do
>> >> this with timers within existing stage.
>> >>
>> >>>
>> >>> Pawel
>> >>>
>> >>> On 8 February 2018 at 05:40, Raghu Angadi <rang...@google.com> wrote:
>> >>>>
>> >>>> The watermark is not directly available, you essentially infer from
>> >>>> fired triggers (e.g. fixed windows). I would consider some of these
>> options
>> >>>> :
>> >>>>   - Adhoc debugging : if the pipeline is close to realtime, you can
>> just
>> >>>> guess if a element will be dropped based on its timestamp and
>> current time
>> >>>> in the first stage (before first aggregation)
>> >>>>   - Increase allowed lateness (say to 3 days) and drop the elements
>> >>>> yourself you notice are later than 1 day.
>> >>>>   - Place the elements into another window with larger allowed
>> lateness
>> >>>> and log very late elements in another parallel aggregation (see
>> >>>> TriggerExample.java in Beam repo).
>> >>>>
>> >>>> On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <car...@mrcalonso.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi everyone!!
>> >>>>>
>> >>>>> I have a streaming job running with fixed windows of one hour and
>> >>>>> allowed lateness of two days and the number of dropped due to
>> lateness
>> >>>>> elements is slowly, but continuously growing and I'd like to
>> understand
>> >>>>> which elements are those.
>> >>>>>
>> >>>>> I'd like to get the watermark from inside the job to compare it
>> against
>> >>>>> each element and write log messages with the ones that will be
>> potentially
>> >>>>> discarded.... Does that approach make any sense? I which case...
>> How can I
>> >>>>> get the watermark from inside the job? Any other ideas?
>> >>>>>
>> >>>>> Thanks in advance!!
>> >>>>
>> >>>>
>> >>>
>> >
>>
>

Reply via email to