It sounds like you would like to have something like event-time-based
windowing, but with independent watermarking for every key. An approach
that can work, but it is somewhat cumbersome, is to not use watermarks or
windows, but instead put all of the logic in a KeyedProcessFunction (or
RichFlatMap). In this way you are free to implement your own policy for
deciding when a given "window" for a specific key (device) is ready for
processing, based solely on observing the events for that specific key.

Semantically I think this is similar to running a separate instance of the
job for each source, but with multi-tenancy, and with an impoverished API
(no watermarks, no event time timers, no event time windows).

Note that it is already the case that each parallel instance of an operator
has its own, independent notion of the current watermark. I believe your
problems arise from the fact that this current watermark is applied to all
events processed by that instance, regardless of their keys. I believe you
would like each key to maintain its own current watermark (event time
clock), so if one key (device) is idle, its watermark will wait for further
events to arrive. As it is now, events for other keys processed by the same
operator instance (or subtask) will advance the shared watermark, causing
an idle device's events to become late.

Regards,
David

On Fri, Jul 31, 2020 at 1:42 PM Sush Bankapura <
sushrutha.bankap...@man-es.com> wrote:

> Hi,
>
> We have a single Flink job that works on data from multiple data sources.
> These data sources are not aligned in time and also have intermittent
> connectivity lasting for days, due to which data will arrive late
>
> We attempted to use the event time and watermarks with parallel streams
> using keyby for the data source
>
> In case of parallel streams, for certain operators, the event time clock
> across all the subtasks  of the operator is the minimum value of the
> watermark among all its input streams.
>
> Reference:
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#watermarks
> in-parallel-streams
>
> While this seems to be a fundamental concept of Flink, are there any plans
> of having event  time clock per operator per subtask for such operators?
>
> This is causing us, not to use watermarks and to fallback on processing
> time semantics or in the worst case running the same Flink job for each and
> every different data source from which we are collecting data through Kafka
>
> Thanks,
> Sush
>

Reply via email to