Re: Deduplicate messages from Kafka topic

2017-01-15 Thread Tzu-Li (Gordon) Tai
Hi,

You’re correct that the FlinkKafkaProducer may emit duplicates to Kafka topics, 
as it currently only provides at-least-once guarantees.
Note that this isn’t a restriction only in the FlinkKafkaProducer, but a 
general restriction for Kafka's message delivery.
This can definitely be improved to exactly-once (no duplicates produced into 
topics) once Kafka supports transactional messaging.

On the consumer side, the FlinkKafkaConsumer doesn’t have built-in support to 
dedupe the messages read from topics.
On the other hand this isn’t really feasible, as consumers could basically only 
view messages with different offsets as separate independent messages, unless 
identified by some user application-level logic.
So in the end, we’ll need to rely on the assumption that messages produced into 
Kafka topics are not duplicated, which as explained above, will hopefully be 
available in the near future.

Cheers,
Gordon

On January 14, 2017 at 6:12:29 PM, ljwagerfield (lawre...@dmz.wagerfield.com) 
wrote:

As I understand it, the Flink Kafka Producer may emit duplicates to Kafka  
topics.  

How can I deduplicate these messages when reading them back with Flink (via  
the Flink Kafka Consumer)?  

For example, is there any out-the-box support for deduplicating messages,  
i.e. by incorporating something like "idempotent producers" as proposed by  
Jay Krepps (which, as I understand it, involves maintaining a "high  
watermark" on a message-by-message level)?  



--  
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Deduplicate-messages-from-Kafka-topic-tp11051.html
  
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.  


Deduplicate messages from Kafka topic

2017-01-14 Thread ljwagerfield
As I understand it, the Flink Kafka Producer may emit duplicates to Kafka
topics.

How can I deduplicate these messages when reading them back with Flink (via
the Flink Kafka Consumer)?

For example, is there any out-the-box support for deduplicating messages,
i.e. by incorporating something like "idempotent producers" as proposed by
Jay Krepps (which, as I understand it, involves maintaining a "high
watermark" on a message-by-message level)?



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Deduplicate-messages-from-Kafka-topic-tp11051.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.