Hello all,
I've spent the last few months testing Bufstream, a Kafka-compatible
system. In the course of that research, we discovered that the Kafka
transaction protocol allows aborted reads, lost writes, and torn
transactions:
https://jepsen.io/analyses/bufstream-0.1.0
In short, the protocol assumes that message delivery is ordered, but
sends messages over different TCP connections, to different nodes, with
automatic retries. When network or node hiccups (e.g. garbage
collection) delay delivery of a commit or abort message, that message
can commit or abort a different, later transaction. Committed
transactions can actually be lost. Aborted transactions can actually
succeed. Transactions can be torn into parts: some of their effects
committed, others lost.
We've reproduced these problems in both Bufstream and Kafka itself, and
we believe every Kafka-compatible system is most likely susceptible.
KIP-890 may help. Client maintainers may also be able to defend against
this problem by re-initializing producers on indefinite errors, like RPC
timeouts.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235834631#KIP890:TransactionsServerSideDefense-BumpEpochonEachTransactionforNewClients(1).
Yours truly,
--Kyle