Thank you for the reply. Yes kylin would not know the semantic of duplicate using kafka consumer api. It's left to the custom application code to do that.
The question actually means “ any best practice to implementing de-duplication custom application code with kylin streaming cube?” For example, a naive solution would be: Assign each "row" an uuid and ensure each row goes to a fixed topic partition. Assume that no kafka retry will happen after 10 seconds. In kylin's "read kafka message to HDFS" step, add some application logic to save the uuid for the past 10s of a topic partition, and discard any message if duplicated uuid found. but this naive solution needs to modify KafkaInputRecordReader(if using MR engine) and costs some memory. Are there any suggested way or best practice to do this? Thanks. ________________________________ From: Billy Liu <[email protected]> Sent: Tuesday, May 16, 2017 3:07 PM To: user Subject: Re: Streaming cube - workaround to duplicate messages by kafka producer retry? Kafka provides the ack mechanism, although all ack solution would hurt the throughput and performance. User could configure it by kafka client parameter. Kylin would not know and should not know how to process the duplicate messages. The duplicate is semantic concept. What Kylin could guarantee is to not consume the messages more than once. 2017-05-16 22:37 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>: Hi, Current version of Kafka producer provides at least once semantics. Duplicates may occur in the stream due to producer retries. ( the idempotent producer is still under development https://issues.apache.org/jira/browse/KAFKA-4815 ) Idempotent/transactional Producer Checklist (KIP-98)<https://issues.apache.org/jira/browse/KAFKA-4815> issues.apache.org<http://issues.apache.org> This issue tracks implementation progress for KIP-98: https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging. When using streaming cube, Kylin may get duplicated messages and provide unexpected result. Does anyone have some experience dealing with this problem? I think this is more about Kafka itself, but since no Idempotent producer is available at current time, could I have some advice to work around it on Kylin side? Thanks.
