Re: Dynamically merge multiple upstream souces

2020-06-05 Thread Arvid Heise
Hi Yi,

one option is to use Avro, where you define one global Avro schema as the
source of truth. Then you add aliases [1] to this schema for each source
where the fields are named differently. You use the same schema to read the
Avro messages from Kafka and Avro automatically converts the data with its
write schema (stored in some schema registry) to your global schema.

If you pay close attention to the requirements of schema evolution of Avro
[2], you could easily add and remove some fields in the different sources
without changing anything programmatically in your Flink ingestion job.

[1] https://avro.apache.org/docs/1.8.1/spec.html#Aliases
[2] https://avro.apache.org/docs/1.8.1/spec.html#Schema+Resolution

On Tue, Jun 2, 2020 at 8:35 AM  wrote:

> Hi all,
>
> I have a user case where I want to merge several upstream data source
> (Kafka topics). The data are essential the same,
> but they have different field names.
>
> I guess I can say my problem is not so much about flink itself. It is
> about how to deserialize data and merge different data effectively with
> flink.
> I can define different schemas and then deserialize data and merge them
> manually. I wonder if there is any dynamical way to do such thing, that is,
> I want to changing field names works like changing pandas dataframe column
> names. I see there is already
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
> but resorting to pandas implies I need to work with python, which is
> something I prefer not to do.
>
> What is your practice on dynamically changing sources and merging them?
> I'd love to here your opinion.
>
> Bests,
> Yi
>


-- 

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng


Dynamically merge multiple upstream souces

2020-06-02 Thread uuuuuu
Hi all,

I have a user case where I want to merge several upstream data source (Kafka 
topics). The data are essential the same,
but they have different field names.

I guess I can say my problem is not so much about flink itself. It is about how 
to deserialize data and merge different data effectively with flink.
I can define different schemas and then deserialize data and merge them 
manually. I wonder if there is any dynamical way to do such thing, that is,
I want to changing field names works like changing pandas dataframe column 
names. I see there is already
https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
but resorting to pandas implies I need to work with python, which is something 
I prefer not to do.

What is your practice on dynamically changing sources and merging them? I'd 
love to here your opinion.

Bests,
Yi