Re: Kafka KeyedStream source

2017-01-18 Thread Fabian Hueske
Hi Niels, I was more talking from a theoretical point of view. Flink does not have a hook to inject a custom hash function (yet). I'm not familiar with the details of the implementation to make an assessment whether this would be possible or how much work it would be. However, several users have

Re: Kafka KeyedStream source

2017-01-16 Thread Fabian Hueske
Hi Niels, I think the biggest problem for keyed sources is that Flink must be able to co-locate key-partitioned state with the pre-partitioned data. This might work, if the key is the partition ID, i.e, not the original key attribue that was hashed to assign events to partitions. Flink could

Re: Kafka KeyedStream source

2017-01-15 Thread Tzu-Li (Gordon) Tai
Hi Niels, If it’s only for simple data filtering that does not depend on the key, a simple “flatMap” or “filter" directly after the source can be chained to the source instances. What that does is that the filter processing will be done within the same thread as the one fetching data from a

Re: Kafka KeyedStream source

2017-01-11 Thread Niels Basjes
Hi, Ok. I think I get it. WHAT IF: Assume we create a addKeyedSource(...) which will allow us to add a source that makes some guarantees about the data. And assume this source returns simply the Kafka partition id as the result of this 'hash' function. Then if I have 10 kafka partitions I would

Re: Kafka KeyedStream source

2017-01-09 Thread Tzu-Li (Gordon) Tai
Hi Niels, Thank you for bringing this up. I recall there was some previous discussion related to this before: [1]. I don’t think this is possible at the moment, mainly because of how the API is designed. On the other hand, a KeyedStream in Flink is basically just a DataStream with a hash

Kafka KeyedStream source

2017-01-05 Thread Niels Basjes
Hi, In my scenario I have click stream data that I persist in Kafka. I use the sessionId as the key to instruct Kafka to put everything with the same sessionId into the same Kafka partition. That way I already have all events of a visitor in a single kafka partition in a fixed order. When I read