On Mon, Jun 19, 2023 at 07:20:34PM +0200, kvaps wrote: > Hi Kevin and the community,
[ CC Alberto, Alice, Stefan ] > > I am designing a CSI driver for Kubernetes that allows efficient > utilization of SAN (Storage Area Network) and supports thin > provisioning, snapshots, and ReadWriteMany mode for block devices. Hi Andrei, Good to hear that. Alberto also has been working on a CSI driver which makes use of qemu-storage-daemon and qcwo2 files either with local storage or shared storage like NFS. At this point of time it focusses on filesystem backends as that's where it is easiest to manage qcow2 files. But I think that could be extended to support block device backends (ex. LVM) too. https://gitlab.com/subprovisioner/subprovisioner This is still work in progress. But I think there might be some overlaps in your work and subprovisoner project. > > To implement this, I have explored several technologies such as > traditional LVM, LVMThin (which does not support shared mode), and > QCOW2 on top of block devices. This is the same approach to what oVirt > uses for thin provisioning over shared LUN: > > https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md > > Based on benchmark results, I found that the performance degradation > of block-backed QCOW2 is much lower compared to LVM and LVMThin while > creating snapshots. > > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352 > > Therefore, I have decided to use the same aproach for Kubernetes. Hmm..., I will need to spned more time going through numbers and setup. This result is little surprising to me though. If you are using vduse, nbk, ublk kind of exports, that means all IO will go to kernel first, then to userspace(qsd) and then back into kernel. But with pure LVM based approach, I/O path is much shorter (user space to kernel). Given that, its little surprising that qcow2 is still faster as compared to LVM. If you somehow managed to use vhost-user-blk export instead, then I/O path is shorter for qcow2 as well and that might perform well. > > But in Kubernetes, the storage system needs to be self-sufficient and > not depended to the workload that uses it. Thus unlike oVirt, we have > no option to use the libvirt interface of the running VM to invoke the > live-migration. Instead, we should provide pure block device in > ReadWriteMany mode, where the block device can be writable on multiple > hosts simultaneously. > > To achieve this, I decided to use the qemu-storage-daemon with the > VDUSE backend. > > Other technologies, such as NBD and UBLK, were also considered, and > their benchmark results can be seen in the same document on the > different sheet: > > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126 > > Taking into account the performance, stability, and versatility, I > concluded that VDUSE is the optimal choice. To connect the device in > Kubernetes, the virtio-vdpa interface would be used, and the entire > scheme could look like this: NBD will be slow. I am curious to know how do UBLK and VDUSE block compare. Technically there does not seem to be any reason by VDUSE virtio-vdpa device will be faster as compared to ublk. But I could be wrong. What about vhost-user-blk export. Have you considered that? That probably will be fastest. > > > +---------------------+ +---------------------+ > | node1 | | node2 | > | | | | > | +-----------+ | | +-----------+ | > | | /dev/vda | | | | /dev/vda | | > | +-----+-----+ | | +-----+-----+ | > | | | | | | > | virtio-vdpa | | virtio-vdpa | > | | | | | | > | vduse | | vduse | > | | | | | | > | qemu-storage-daemon | | qemu-storage-daemon | > | | | | | | > | +------- | -------+ | | +------- | -------+ | > | | LUN | | | | | LUN | | | > | | +-----+-----+ | | | | +-----+-----+ | | > | | | LV (qcow2)| | | | | | LV (qcow2)| | | > | | +-----------+ | | | | +-----------+ | | > | +--------+--------+ | | +--------+--------+ | > | | | | | | > | | | | | | > +--------- | ---------+ +--------- | ---------+ > | | > | +-----+ | > +---------| SAN |--------+ > +-----+ > > Despite two independent instances of qemu-storage-daemon for same > qcow2 disk running successfully on different hosts, I have concerns > about their proper functioning. Similar to live migration, I think > they should share the state between each other. Is it same LV on both the nodes? How are you activating same LV on two nodes? IIUC, LVM does not allow that. > > The question is how to make qemu-storage-daemon to share the state > between multiple nodes, or is qcow2 format inherently stateless and > does not requires this? That's a good question. For simplicity we could think of NFS backed storage and a qcow2 file providing storage. Can two QSD instances actually work with same qcow2 file? I am not sure this can be made to work with writable storage. Read-only storage, probably yes. For example, even if QSD could handle that, we will be having some local filesystem visible to client on this block device (say ext4/xfs/btrfs). These are built for one user and they don't expect any other client is changing the blocks at the same time. So I am not sure how one can export ReadWriteMany volumes using qcow2 or LVM for that matter. We probably need a shared filesystem for that (NFS, GFS etc). Am I missing something? Thanks Vivek