Hello Everyone,
I am quite exited about the recent example of replicating PostgresSQL
Changes to Kafka. My view on the log compaction feature always had been
a very sceptical one, but now with its great potential exposed to the
wide public, I think its an awesome feature. Especially when pulling
this data into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want
to thank everyone who had the vision of building these kind of systems
during a time I could not imagine those.
There is one open question that I would like people to help me with.
When pulling a snapshot of a partition into HDFS using a camus-like
application I feel the need of keeping a Set of all keys read so far and
stop as soon as I find a key beeing already in my set. I use this as an
indicator of how far the log compaction has happened already and only
pull up to this point. This works quite well as I do not need to keep
the messages but only their keys in memory.
The question I want to raise with the community is:
How do you prevent pulling the same record twice (in different versions)
and would it be beneficial if the "OffsetResponse" would also return the
last offset that got compacted so far and the application would just
pull up to this point?
Looking forward for some recommendations and comments.
Best
Jan