I am planning to rewrite a message queue, drawing on the architectural designs 
of Kafka and Pulsar. The main architecture is as follows: 


The entire system is designed with read-write separation. The message writing 
and replication process is very similar to Kafka, with leader and follower 
replicas handling replication. However, the system will only have a limited 
number of partitions, such as 10 or 20. All queue messages will be written into 
these few partitions, with several brokers sharing the load of writing and 
reading. Writes will be concentrated on one disk, with data also appended to a 
write cache for quick access by consumers. Reads will first attempt to fetch 
from the read cache, and if not found, then from the split partitions on 
/disk2. Each broker utilizes two disks, with /disk1 dedicated to writing and 
replication, and /disk2 serving as secondary storage. On /disk2, the 
concentrated writes to partitions are split and written into their respective 
physical queues for long-term storage. The splitting is done in the background, 
with checkpoints created periodically. Once messages are split, they are pruned 
from the write queue, which is rolled over based on time or size. The write 
queue is short-term storage for writing and primary-replica replication, while 
the read queue serves as long-term storage.

The system draws inspiration from Kafka for writing and replication, and from 
Pulsar for its read-write separation architecture. With years of experience 
using Kafka, I've noticed that an increase in partitions can lead to a rapid 
decline in the throughput of the entire cluster. Therefore, I have merged all 
the queues and written them in large blocks to a limited number of partitions, 
reducing the number of partitions within the cluster to improve the throughput 
of writing and replication. At the same time, I have increased the read and 
write cache to support the rapid reading of consumer messages, with secondary 
storage serving as long-term message persistence. The secondary storage, which 
does not affect the write process, is split and written to disk according to 
the actual partitions.

Considering the need to reduce memory leaks and the troubles of Java GC, the 
entire system is written in Rust language to lower the application's own 
footprint, freeing up a significant amount of memory for use as read cache and 
write cache. The project is currently in the design stage, and I sincerely ask 
everyone for some suggestions.

我准备重写一个message queue,借鉴kafka和pulsar的架构设计,主要架构如下图:
整个系统是读写分离的,消息的写入和复制与kafka很类似,leader 和 follower 
replica做复制,但是整个系统只会有有限的几个partition,比如10个,活20个,然后所有的队列的消息都写入这几个partition里,几个broker来分担写入和读取。写入都会集中在一块磁盘上,写入时会附加写入到write
 cache里,供消费者快速拿到,读取时会先从read 
cache内读取,如果没有再从/disk2的拆分partitoin内读取。单台broker使用两块磁盘,/disk1专门用来做写入和复制,/disk2作为次级磁盘使用,对集中写入的partition做拆分,拆分到各自的物理队列里,作为长久存储。拆分是在后台进行的,隔断时间做checkpoint,拆分过的消息,会在写队列里做淘汰,按时间或大小去滚动。这里写队列是一个短期存储,只是为了写入和主副本做复制,读队列会做为长久的存储。

整个系统是借鉴了kafka和写入和复制,同时也借鉴了pulsar的读写分离架构。因为有多年的kafka使用经验,我发现partition在剧增后,整个集群的吞吐下降的很快,所以这里我将所有的队列都合并起来,整块写入到有限的几个parition内,减少集群内的partition数量,提升写入和复制的吞吐。同时,增加读写缓存,来支持消费的快速消息读取,同时次级存储作为长久的消息持久化存储,按真实的partition来拆分写入到磁盘,次级存储不影响写入流程。

考虑到减少内存泄露和java gc的困扰,整个系统使用rust语言来编写,降低应用的本身的footprint,节省出大量内存来作为read 
cache和write cache来使用。
项目目前还处在设计阶段,恳请大家提一些建议。

Reply via email to