Hello all,

I've spent the last few months testing Bufstream, a Kafka-compatible system. In the course of that research, we discovered that the Kafka transaction protocol allows aborted reads, lost writes, and torn transactions:

https://jepsen.io/analyses/bufstream-0.1.0

In short, the protocol assumes that message delivery is ordered, but sends messages over different TCP connections, to different nodes, with automatic retries. When network or node hiccups (e.g. garbage collection) delay delivery of a commit or abort message, that message can commit or abort a different, later transaction. Committed transactions can actually be lost. Aborted transactions can actually succeed. Transactions can be torn into parts: some of their effects committed, others lost.

We've reproduced these problems in both Bufstream and Kafka itself, and we believe every Kafka-compatible system is most likely susceptible. KIP-890 may help. Client maintainers may also be able to defend against this problem by re-initializing producers on indefinite errors, like RPC timeouts.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235834631#KIP890:TransactionsServerSideDefense-BumpEpochonEachTransactionforNewClients(1).

Yours truly,

--Kyle

Reply via email to