Re: In Flink, is there a way to merge two streams in stateful manner

Hang Ruan Thu, 10 Aug 2023 19:15:22 -0700

ps: Forget the link: Hybrid Source[1]

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/datastream/hybridsource/


Hang Ruan <[email protected]> 于2023年8月11日周五 10:14写道：

> Hi, Muazim.
>
> I think the Hybird Source[1] may be helpful for your case.
>
> Best,
> Hang
>
> Ken Krugler <[email protected]> 于2023年8月11日周五 04:18写道：
>
>> As (almost) always, the devil is in the details.
>>
>> You haven’t said, but I’m assuming you’re writing out multiple files,
>> each with a different schema, as otherwise you could just leverage the
>> existing Flink support for CSV.
>>
>> So then you could combine the header/footer streams (adding a flag for
>> header vs. footer), and connect that to the row data stream, then use a
>> KeyedCoProcessFunction (I’m assuming you can key by something that
>> identifies which schema). You’d buffer the row data & footer (separately in
>> state). You would also need to set up a timer to fire at the max watermark,
>> to flush out pending rows/footer when all of the input data has been
>> processed.
>>
>> After that function you could configure the sink to bucket by the target
>> schema.
>>
>> — Ken
>>
>>
>> On Aug 10, 2023, at 10:41 AM, Muazim Wani <[email protected]> wrote:
>>
>> Thanks for the response!
>> I have a specific use case where I am writing to a TextFile sink. I have
>> a Bounded stream of header data and need  to merge it with another bounded
>> stream. While writing the data to a text file the header data should be
>> written before the original data(from another bounded stream). And also at
>> last I have another stream of footers where I would repeat the same process.
>> I tried keeping an identifier for all three streams and based on these
>> identifiers I added the data in three different ListState
>> using KeyedProcess functions. So for headers I directly emitted the value
>> but for main data and footers if I store it in a state . The issue is
>> Outside KeyedProcess I am not able to emit the data in order.
>> Is there any way I can achieve the use case of orderding the dataStreams
>> . I also tried with union but it seems it adds data arbitrarily in both
>> streams .
>> Thanks and regards
>>
>> On Thu, 10 Aug, 2023, 10:59 pm Ken Krugler, <[email protected]>
>> wrote:
>>
>>> Hi Muazim,
>>>
>>> In Flink, a stream of data (unless bounded) is assumed to never end.
>>>
>>> So in your example below, this means stream 2 would NEVER be emitted,
>>> because stream 1 would never end (there is no time at which you know for
>>> sure that stream 1 is done).
>>>
>>> And this in turn means stream 2 would be buffered forever in state, thus
>>> growing unbounded.
>>>
>>> I would suggest elaborating on your use case.
>>>
>>> Regards,
>>>
>>> — Ken
>>>
>>>
>>> On Aug 10, 2023, at 10:11 AM, Muazim Wani <[email protected]> wrote:
>>>
>>> Hi Team,
>>> I have a use case where I have two streams and want to join them in
>>> stateful manner .
>>> E.g data of stream 1 should be emitted before stream2.
>>> I tried to store the data in ListState in KeyedProcessFunction but I am
>>> not able to access state  outside proccessElement().
>>> Is there any way I could achieve this?
>>> Thanks and regards
>>>
>>>
>>> --------------------------
>>> Ken Krugler
>>> http://www.scaleunlimited.com
>>> Custom big data solutions
>>> Flink, Pinot, Solr, Elasticsearch
>>>
>>>
>>>
>>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> Custom big data solutions
>> Flink, Pinot, Solr, Elasticsearch
>>
>>
>>
>>

Re: In Flink, is there a way to merge two streams in stateful manner

Reply via email to