Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2

2019-10-26 Thread Alberto Garcia
On Wed 23 Oct 2019 12:39:14 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> Hi!
>
> This is very interesting! Could you please export a branch to look at,
> as patches can't be applied on master now :(

I just sent a new version with some updates and rebased on top of the
current master.

Berto



Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2

2019-10-23 Thread Vladimir Sementsov-Ogievskiy
Hi!

This is very interesting! Could you please export a branch to look at,
as patches can't be applied on master now :(

15.10.2019 18:23, Alberto Garcia wrote:
> Hi,
> 
> this series adds a new feature to the qcow2 on-disk format called
> "Extended L2 Entries", which allows us to do subcluster allocation.
> 
> This cover letter explains the reasons behind this proposal, the
> changes to the on-disk format, test results and pending work. If you
> are curious you can also have a look at previous discussions about
> this feature:
> 
> https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
> https://lists.gnu.org/archive/html/qemu-block/2019-06/msg01155.html
> 
> This is the first proper version of the patches, and I believe that
> the implementation is complete. However since I'm proposing a change
> to the on-disk format I'm labeling this as RFC because I'm expecting
> some debate. I'll remove the RFC tag and add more tests in future
> revisions.
> 
> === Problem ===
> 
> A qcow2 image is divided into units of constant size called clusters,
> and among other things it contains metadata that maps guest addresses
> to host addresses (the so-called L1 and L2 tables).
> 
> There are two basic problems that result from this:
> 
> 1) Reading from or writing to a qcow2 image involves reading the
> corresponding entry on the L2 table that maps the guest address to
> the host address. This is very slow because it involves two I/O
> operations: one on the L2 table and the other one on the actual
> data cluster.
> 
> 2) A cluster is the smallest unit of allocation. Therefore writing a
> mere 512 bytes to an empty disk requires allocating a complete
> cluster and filling it with zeroes (or with data from the backing
> image if there is one). This wastes more disk space and also has a
> negative impact on I/O.
> 
> Problem (1) can be solved by caching the L2 tables in memory. The
> maximum amount of disk space used by L2 tables depends on the virtual
> disk size and the cluster size:
> 
> max_l2_size = virtual_disk_size * 8 / cluster_size
> 
> Because of this, the only way to reduce the size of the L2 tables is
> by increasing the cluster size (which can be any power of two between
> 512 bytes and 2 MB). But then we hit problem (2): I/O is slower and
> more disk space is wasted.
> 
> === The proposal ===
> 
> The proposal is to extend the qcow2 format by allowing subcluster
> allocation. The on-disk format remains essentially the same, except
> that each data cluster is internally divided into 32 subclusters of
> equal size.
> 
> The way it works in practice is with a new optional feature called
> "Extended L2 Entries", that needs to be enabled when an image is
> created. With this, each entry on an L2 table is accompanied by a
> bitmap indicating the allocation state of each one of the subclusters
> for that cluster. The size of an L2 entry doubles from 64 to 128 bits.
> 
> Other than L2 entries, all other data structures remain unchanged, but
> for data clusters the smallest unit of allocation is now the
> subcluster. Reference counting is still at the cluster level, because
> there is no way to reference individual subclusters. Copy-on-write on
> internal snapshots needs to copy complete clusters, so that scenario
> would not benefit from this change.
> 
> I see two main use cases for this feature:
> 
> a) The qcow2 image is not too large / the L2 cache is not a problem,
> but you want to increase the allocation performance. In this case
> you can have a 128KB cluster with 4KB subclusters (with 4KB being a
> common block size in ext4 and other filesystems)
> 
> b) The qcow2 image is very large and you want to save metadata space
> in order to have a smaller L2 cache. In this case you can go for
> the maximum cluster size (2MB) but you want to have smaller
> subclusters to increase the allocation performance and optimize the
> disk usage.
> 
> === Changes to the on-disk format ===
> 
> An L2 entry is 64 bits wide, with this format (for uncompressed
> clusters):
> 
> 6356 5548 4740 3932 3124 2316 15 8 7  0
>        
> **<> <--><--->*
>Rsrved  host cluster offset of data Reserved
>(6 bits)(47 bits)   (8 bits)
> 
>  bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
>  bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
>  bit  0: all zeros   (QCOW_OFLAG_ZERO)
> 
> If Extended L2 Entries are enabled, bit 0 becomes reserved and must be
> unset, and this 64-bit bitmap follows the entry:
> 
> 6356 5548 4740 3932 3124 2316 15 8 7  0
>        
> <-> <->
> 

Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2

2019-10-16 Thread Alberto Garcia
On Tue 15 Oct 2019 06:05:23 PM CEST, Eric Blake wrote:

>> 6356 5548 4740 3932 3124 2316 15 8 7  0
>>        
>> <-> <->
>>   subcluster reads as zerossubcluster is allocated
>>   (32 bits)   (32 bits)
>
> okay, in patch 5, you said you map the most significant bit to the
> first cluster. That feels backwards to me; I wonder if the math is any
> easier if you map sub-clusters starting from the least-significant,
> because then you get:
>
> bit = (address >> cluster_size) & 32
>
> rather than
>
> bit = 31 - ((address >> cluster_size) & 32)

The reason why I chose that ordering is because I think it's more
natural for debugging if you read from left to right:

6356 5548 4740 3932 3124 2316 15 8 7  0
   0001 1110   
<-> <->
  subcluster reads as zerossubcluster is allocated

Here the last five subclusters read as zeros, and the first three
subclusters are allocated.

I don't think the math is any different. What you need in the code is

  1) A way to get the subcluster index. That doesn't change, it's

sc_index = (address >> cluster_bits) & 31

 in both cases.

  2) A way to get the "subcluster reads as zeros" and "subcluster is
 allocated" masks. That's not very different either, it's a constant
 shifted by the subcluster index in both cases:

 LSB first:

all_zeros_mask = (1 << 32) << sc_index
allocated_mask = 1 << sc_index

 MSB first:

all_zeros_mask = (1 << 63) >> sc_index
allocated_mask = (1 << 31) >> sc_index

>> Some comments about the results:
>> 
>> - The smallest allowed cluster size for an image with subclusters is
>>16 KB (in this case the subclusters size is 512 bytes), hence the
>>missing values in the 4 KB and 8 KB rows.
>
> Again reading ahead, I see that patch 5 requires a 16k minimum cluster 
> for using extended L2.  Could we still permit clusters smaller than 
> that, but merely document that subclusters are always a minimum of 512 
> bytes and therefore for an 8k cluster we only use 16 bits (leaving the 
> other 16 bits zero)?  But I'm also fine with the simplicity of just 
> stating that subclusters require at least 16k clusters.

I can't think of any reason why you would want smaller clusters, the
numbers show that the performance starts to drop with sizes under 16KB.

> How do subclusters interact with external data files?

As far as I'm aware they work just fine (I'll add tests for that
anyway).

Berto



Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2

2019-10-15 Thread Eric Blake

On 10/15/19 10:23 AM, Alberto Garcia wrote:

Hi,

this series adds a new feature to the qcow2 on-disk format called
"Extended L2 Entries", which allows us to do subcluster allocation.

This cover letter explains the reasons behind this proposal, the
changes to the on-disk format, test results and pending work. If you
are curious you can also have a look at previous discussions about
this feature:




=== Changes to the on-disk format ===

An L2 entry is 64 bits wide, with this format (for uncompressed
clusters):

6356 5548 4740 3932 3124 2316 15 8 7  0
       
**<> <--><--->*
   Rsrved  host cluster offset of data Reserved
   (6 bits)(47 bits)   (8 bits)

 bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
 bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
 bit  0: all zeros   (QCOW_OFLAG_ZERO)

If Extended L2 Entries are enabled, bit 0 becomes reserved and must be
unset, and this 64-bit bitmap follows the entry:

6356 5548 4740 3932 3124 2316 15 8 7  0
       
<-> <->
  subcluster reads as zerossubcluster is allocated
  (32 bits)   (32 bits)


I like the grouping - you can then do a 4-byte read and comparison to 0 
to see if the entire cluster reads as zeroes or is unallocated.


With 32k clusters, this results in 1k subclusters.  In cluster 1 (offset 
32k), which bits map where?  (The obvious choices are that sub-cluster 
32k maps to bit 0, 33k maps to bit 1, ...; or that sub-cluster 32k maps 
to bit 31, 33k maps to bit 30, ...)


/me reads ahead

okay, in patch 5, you said you map the most significant bit to the first 
cluster. That feels backwards to me; I wonder if the math is any easier 
if you map sub-clusters starting from the least-significant, because 
then you get:


bit = (address >> cluster_size) & 32

rather than

bit = 31 - ((address >> cluster_size) & 32)



Some comments about the results:

- The smallest allowed cluster size for an image with subclusters is
   16 KB (in this case the subclusters size is 512 bytes), hence the
   missing values in the 4 KB and 8 KB rows.


Again reading ahead, I see that patch 5 requires a 16k minimum cluster 
for using extended L2.  Could we still permit clusters smaller than 
that, but merely document that subclusters are always a minimum of 512 
bytes and therefore for an 8k cluster we only use 16 bits (leaving the 
other 16 bits zero)?  But I'm also fine with the simplicity of just 
stating that subclusters require at least 16k clusters.




=== To do ===

A couple of things are missing from this series:

- The ability to efficiently zero individual subclusters using
   qcow2_co_pwrite_zeroes(). At the moment only full clusters can be
   zeroed with this method.

- Alternatively we could get rid of the individual "all zeroes" bits
   altogether and have 64 subclusters per cluster. We would still have
   the QCOW_OFLAG_ZERO bit in the standard cluster descriptor.


I think you've got more flexibility with the two bits per sub-cluster 
than you would with just 1 bit and 64 subclusters, so I don't think this 
direction is going to get us far.




- The number of subclusters per cluster is always 32. It would be
   trivial to allow configuring this, but I don't see any use case.


Agreed.



- Tests: I have a few written that I'll add in future revisions of
   this series.

- handle_alloc_space() works at the subclusters level. That is, if you
   have an unallocated 2MB cluster with 64KB subclusters, no backing
   image and you write 4KB of data, QEMU won't write zeroes to the
   affected subcluster(s) and will use handle_alloc_space() instead.
   The other subclusters won't be touched and will remain unallocated.
   This behavior is consistent with how subclusters work and saves disk
   space, but offers slightly lower performance (see test results
   above). Theoretically we could offer a setting to configure this,
   but I'm not convinced that this is very useful.

===

As usual, feedback is welcome,


Looks promising!

How do subclusters interact with external data files?

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org