Hi Darshan,

In your use case, I think you can implement the outer join with DataStream
API ( use State + ProcessFunction + Timer ). Using suitable statue, you can
store 1 value per key and do not need to keep all the value's history for
every key.

And you can refer to Flink's implementation of DataStream join[1].

[1]:
https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamJoin.scala#L223

Thanks, vino.

2018-07-24 1:28 GMT+08:00 Darshan Singh <[email protected]>:

> Hi
>
> I was looking at the new full outer join. This seems to be working fine
> for my use case however I have a question regarding the state size.
>
> I have 2 streams each will have 100's of million unique keys. Also, Each
> of these will get the updated value of keys 100's of times per day.
>
> As per my understanding in full outer join flink will keep all the values
> of the keys which it has seen in the state and whenever a new value comes
> from
> 1 of the stream. It will be joined against all of the key values which
> were there for 2nd stream.It could be 1 or 100's of rows. This seems
> inefficient
> but my question is more on the state side. Thus, I will need to keep
> billion's of values in state on both side. This will be very expensive.
>
> It is a non windowed join. A key can recieve updates for 50-60 days and
> after that it wont get any updates on any of the streams.
>
> Is there a way we could use a state such that only 1 value per key is
> retained in the state to reduce the size of the state?
>
> I am using the Table API but could use the Datastream api if needed.
>
> Thanks
>

Reply via email to