There are a few ways to pre-ingest data from a side input before beginning
to process another stream. One is to use the State Processor API [1] to
create a savepoint that has the data from that side input in its state. For
a simple example of bootstrapping state into a savepoint, see [2].

Another approach is to buffer the stream to be validated in Flink state
until the side input has been fully ingested. Or run the job once with no
event traffic and take a savepoint once the model has been broadcast.

Yet another solution might be to use a custom source that reads from one
topic and then the other. See [3] and [4] for an example of that.

Other references on this topic include FLIP-17 [5] and Gregory Fee's talk
on bootstrapping state [6].

Regards,
David

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html
[2] https://gist.github.com/alpinegizmo/ff3d2e748287853c88f21259830b29cf
[3] https://stackoverflow.com/a/48711260/2000823
[4]
https://github.com/ScaleUnlimited/flink-streaming-kmeans/blob/master/src/main/java/com/scaleunlimited/flinksources/UnionedSources.java
[5]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
[6] https://www.youtube.com/watch?v=WdMcyN5QZZQ

On Fri, Apr 3, 2020 at 9:08 PM Georgi Stoyanov <gstoya...@live.com> wrote:
>
> Hi,
>
> I want to implement a flow where the data from one stream is needed to
> validate data for second stream when the job is started without a
> savepoint or checkpoint.
>
> Both of them are reading from kafka. I want the data in the first one to
> be fully read and then to check the events from the second stream.
>
> Do you have any suggestions how this could be achieved (maybe without
> window or just put events in a state and start a timer)?
>
>
> Kind Regards
>
> G.S.

Reply via email to