On Mon, Feb 13, 2017 at 11:04:30AM +0100, Kevin Wolf wrote: > Am 12.02.2017 um 01:58 hat Nir Soffer geschrieben: > > On Sat, Feb 11, 2017 at 12:23 AM, Nir Soffer <nir...@gmail.com> wrote: > > > Hi all, > > > > > > I'm trying to convert images (mostly qcow2) to raw format on thin lv, > > > hoping to write only the allocated blocks on the thin lv, but > > > it seems that qemu-img cannot write sparse image on a block > > > device. > > > > > > (...) > > > > So it seems that qemu-img is trying to write a sparse image. > > > > I tested again with empty file: > > > > truncate -s 20m empty > > > > Using strace, qemu-img checks the device discard_zeroes_data: > > > > ioctl(11, BLKDISCARDZEROES, 0) = 0 > > > > Then it find that the source is empty: > > > > lseek(10, 0, SEEK_DATA) = -1 ENXIO (No such device > > or address) > > > > Then it issues one call > > > > [pid 10041] ioctl(11, BLKZEROOUT, 0x7f6049c82ba0) = 0 > > > > And fsync and close the destination. > > > > # grep -s "" /sys/block/dm-57/queue/discard_* > > /sys/block/dm-57/queue/discard_granularity:65536 > > /sys/block/dm-57/queue/discard_max_bytes:17179869184 > > /sys/block/dm-57/queue/discard_zeroes_data:0 > > > > I wonder why discard_zeroes_data is 0, while discarding > > blocks seems to zero them. > > > > Seems that this this bug: > > https://bugzilla.redhat.com/835622 > > > > thin lv does promise (by default) to zero new allocated blocks, > > and it does returns zeros when reading unallocated data, like > > a sparse file. > > > > Since qemu does not know that the thin lv is not allocated, it cannot > > skip empty blocks safely. > > > > It would be useful if it had a flag to force sparsness when the > > user knows that this operation is safe, or maybe we need a thin lvm > > driver? > > Yes, I think your analysis is correct, I seem to remember that I've seen > this happen before. > > The Right Thing (TM) to do, however, seems to be fixing the kernel so > that BLKDISCARDZEROES correctly returns that discard does in fact zero > out blocks on this device. As soon as this ioctl works correctly, > qemu-img should just automatically do what you want. > > Now if it turns out it is important to support older kernels without the > fix, we can think about a driver-specific option for the 'file' driver > that overrides the kernel's value. But I really want to make sure that > we use such workarounds only in addition, not instead of doing the > proper root cause fix in the kernel. > > So can you please bring it up with the LVM people?
I'm not sure it's that easy. The discard granularity of LVM thin is not equal to their reported block/sector sizes, but to the size of the chunks they allocate. # blockdev --getss /dev/dm-9 512 # blockdev --getbsz /dev/dm-9 4096 # blockdev --getpbsz /dev/dm-9 4096 # cat /sys/block/dm-9/queue/discard_granularity 131072 # I currently don't see qemu using the discard_granularity property for this purpose. IIRC the code for write_zeroes() eg. simply checks the discard_zeroes flag but not what size it is trying to zero-out/discard. We have an experimental semi-complete "can-do-footshooting" 'zeroinit' filter for this purpose to basically explicitly set the "has_zero_init" flag and drop "write_zeroes()" calls to blocks at an address greater than the highest written one up to that point. It should use a dirty bitmap instead and is sort of dangerous this way which is why it's not on the qemu-devel list. But if this approach is at all acceptable (despite being a hack) I could improve it and send it to the list? https://github.com/Blub/qemu/commit/6f6f38d2ef8f22a12f72e4d60f8a1fa978ac569a (you'd just prefix the destination with `zeroinit:` in the qemu-img command) Additionally I'm currently still playing with the details and quirks of various storages (lvm/dm thin, rbd, zvols) in an attempt to create a tool to convert between various storages. (I did some successful tests converting disk images between these storages & qcow2 together with their snapshots in a COW-aware way...) I'm planning on releasing some experimental code soon-ish (there's still some polishing to do though to the documentation, the library's API and the format - and the qcow2 support is a patch for qemu-img to use the library.) My adventures into dm-thin metadata allows me to answer this one though: > > or maybe we need a thin lvm driver? Probably not. It does not support SEEK_DATA/SEEK_HOLE and to my knowledge also has no other sane metadata querying methods. You'd have to read the metadata device instead. To do this properly you have to reserve a metadata snapshot and there can only ever be one of those per pool, which means you could only have 1 such disk in total running on a system and no other dm-thin metadata aware tool could be used during that time (otherwise the reserver operations will fail with an error and qemu would have to wait&retry a lot...).