Re: What is the right way to use the physical partitioning strategy in Data Streams?

2019-09-24 Thread Felipe Gutierrez
yes. It will be very welcome a discussion with who knows better than me. Basically, I am trying to implement the issue FLINK-1725 [1] that was gave up on March 2017. Stephan Ewen said that there are more issues to be fixed before going to this implementation and I don't really know which are

Re: What is the right way to use the physical partitioning strategy in Data Streams?

2019-09-23 Thread Biao Liu
Wow, that's really cool! There are indeed a lot works you have done. IMO it's beyond the scope of user group somewhat. Just one small concern, I'm not sure I have fully understood your way of "tackle data skew by altering the way Flink partition keys using KeyedStream". >From my understanding,

Re: What is the right way to use the physical partitioning strategy in Data Streams?

2019-09-23 Thread Felipe Gutierrez
I`ve implemented a combiner [1] in Flink by extending OneInputStreamOperator in Flink. I call my operator using "transform". It works well and I guess it is useful if I import this operator in the DataStream.java. I just need more to check if I need to touch other parts of the source code. But

Re: What is the right way to use the physical partitioning strategy in Data Streams?

2019-09-23 Thread Biao Liu
Hi Felipe, If I understand correctly, you want to solve data skew caused by imbalanced key? There is a common strategy to solve this kind of problem, pre-aggregation. Like combiner of MapReduce. But sadly, AFAIK Flink does not support pre-aggregation currently. I'm afraid you have to implement

Re: What is the right way to use the physical partitioning strategy in Data Streams?

2019-09-23 Thread Felipe Gutierrez
thanks Biao, I see. To achieve what I want to do I need to work with KeyedStream. I downloaded the Flink source code to learn and alter the KeyedStream to my needs. I am not sure but it is a lot of work because as far as I understood the key-groups have to be predictable [1]. and altering this

Re: What is the right way to use the physical partitioning strategy in Data Streams?

2019-09-23 Thread Biao Liu
Hi Felipe, Flink job graph is DAG based. It seems that you set an "edge property" (partitioner) several times. Flink does not support multiple partitioners on one edge. The later one overrides the priors. That means the "keyBy" overrides the "rebalance" and "partitionByPartial". You could insert