Re: Data loss scenarios

Yury Ruchin Wed, 20 Jan 2016 22:26:04 -0800

Disclaimer: the below only applies to KafkaSpout class. It's not about
TridentKafkaSpout which should go to greater length ensuring exactly-once
semantics - I'm not yet familiar with its implementation details.


There is nothing special about Kafka cluster in this case. As for
KafkaSpout, it will remember tuples it fetched in an in-memory Map. After
ack() is received for a tuple, it is removed from the remembed pending
tuples. Every "stateUpdateIntervalMs" it commits to ZK the offset that it
has received acks up to.

Now, if spout crashes, it does not store offsets to ZK and everything
that's pending (including tuples on wire) will be replayed when the spout
is reinstated later. It will re-read latest saved offsets from ZK, re-fetch
and re-emit lost messages. This leads to at-least-once semantics, since
some messages that have not been acked might have been actually processed
by downstream topology, with only ack() notification lost because of the
crash.

If KafkaSpout receives fail() from downstream topology, it will keep
replaying the failed tuple until it falls behind the latest offset more
than "maxOffsetBehind" messages.

Re: Data loss scenarios

Reply via email to