Suppose I have a producer which is ingesting a data stream with unique keys from an external service and sending it to a Kafka topic. In my producer I can set enable.idempotence and get exactly-once delivery in the presence of broker crashes. However my producer might crash after it delivers a batch of messages to Kafka but before it records that the batch was delivered. After restarting the crashed producer it would re-deliver the same batch, resulting in duplicate messages in the topic.
With a streams transformer I can deduplicate the topic by using a state store to record previously seen keys and then only creating an output record if the key hasn't been seen before. However without a mechanism to remove old keys the state store will grow without bound. Say I only want to deduplicate over a time period such as one day. (I'm confident that I'll be able to restart a crashed producer sooner). Thus I'd like keys older than a day to expire out of the state store, so the store only needs to keep track of keys seen in the last day or so. Is there a way to do this with Kafka streams? Or is there another recommended mechanism to keep messages with unique keys unduplicated in the presence of producer crashes? Thanks! Andrew