Pulling Snapshots from Kafka, Log compaction last compact offset

Jan Filipiak Thu, 30 Apr 2015 14:29:04 -0700

Hello Everyone,

I am quite exited about the recent example of replicating PostgresSQLChanges to Kafka. My view on the log compaction feature always had beena very sceptical one, but now with its great potential exposed to thewide public, I think its an awesome feature. Especially when pullingthis data into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I wantto thank everyone who had the vision of building these kind of systemsduring a time I could not imagine those.

There is one open question that I would like people to help me with.When pulling a snapshot of a partition into HDFS using a camus-likeapplication I feel the need of keeping a Set of all keys read so far andstop as soon as I find a key beeing already in my set. I use this as anindicator of how far the log compaction has happened already and onlypull up to this point. This works quite well as I do not need to keepthe messages but only their keys in memory.


The question I want to raise with the community is:

How do you prevent pulling the same record twice (in different versions)and would it be beneficial if the "OffsetResponse" would also return thelast offset that got compacted so far and the application would justpull up to this point?


Looking forward for some recommendations and comments.

Best
Jan

Pulling Snapshots from Kafka, Log compaction last compact offset

Reply via email to