Hello Everyone,

I am quite exited about the recent example of replicating PostgresSQL Changes to Kafka. My view on the log compaction feature always had been a very sceptical one, but now with its great potential exposed to the wide public, I think its an awesome feature. Especially when pulling this data into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want to thank everyone who had the vision of building these kind of systems during a time I could not imagine those.

There is one open question that I would like people to help me with. When pulling a snapshot of a partition into HDFS using a camus-like application I feel the need of keeping a Set of all keys read so far and stop as soon as I find a key beeing already in my set. I use this as an indicator of how far the log compaction has happened already and only pull up to this point. This works quite well as I do not need to keep the messages but only their keys in memory.

The question I want to raise with the community is:

How do you prevent pulling the same record twice (in different versions) and would it be beneficial if the "OffsetResponse" would also return the last offset that got compacted so far and the application would just pull up to this point?

Looking forward for some recommendations and comments.

Best
Jan

Reply via email to