Hi Darshan, In your use case, I think you can implement the outer join with DataStream API ( use State + ProcessFunction + Timer ). Using suitable statue, you can store 1 value per key and do not need to keep all the value's history for every key.
And you can refer to Flink's implementation of DataStream join[1]. [1]: https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamJoin.scala#L223 Thanks, vino. 2018-07-24 1:28 GMT+08:00 Darshan Singh <[email protected]>: > Hi > > I was looking at the new full outer join. This seems to be working fine > for my use case however I have a question regarding the state size. > > I have 2 streams each will have 100's of million unique keys. Also, Each > of these will get the updated value of keys 100's of times per day. > > As per my understanding in full outer join flink will keep all the values > of the keys which it has seen in the state and whenever a new value comes > from > 1 of the stream. It will be joined against all of the key values which > were there for 2nd stream.It could be 1 or 100's of rows. This seems > inefficient > but my question is more on the state side. Thus, I will need to keep > billion's of values in state on both side. This will be very expensive. > > It is a non windowed join. A key can recieve updates for 50-60 days and > after that it wont get any updates on any of the streams. > > Is there a way we could use a state such that only 1 value per key is > retained in the state to reduce the size of the state? > > I am using the Table API but could use the Datastream api if needed. > > Thanks >
