Re: bitmap migration bug with -drive while block mirror runs

2019-10-05 Thread John Snow



On 10/4/19 9:07 AM, Eric Blake wrote:
> On 10/4/19 4:24 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>> The way I see it, we know an auto-generated node name will never be
>>> correct, but an explicitly specified one represents an explicit user
>>> configuration.
>>>
>>> It's wrong to use generated names for migration details, but it's never
>>> wrong to use explicit configuration.
>>>
>>> So I believe they are /already/ distinct in nature. We even already have
>>> code that can tell them apart.
>>
>> Is it restricted to create user node-names formatted like automated ones?
> 
> Yes. Explicit node names cannot begin with '#', while all generated node
> names do.

Right, we already have id_wellformed which tells us which kind of node
names are which. Automatic ones are not wellformed, explicit ones are.

Peter's latest reply to my wall of text is cooling me off on the plan I
laid out in that missive, though.

> 
> 
>>> There are four cases here:
>>>
>>> - The bitmap is loaded to a root node with an explicit name
>>> - The bitmap is loaded to a non-root node with an explicit name
>>>
>>> The blockdev case with persistence. The name represents explicit user
>>> configuration and can be relied upon in the destination.
>>>
>>> - The bitmap is loaded to a root node with an implicit name, with a
>>> named BB
>>>
>>> The -drive case. The named BB represents the explicit user configuration
>>> and can be relied upon.
>>>
>>> - The bitmap is loaded to a non-root node with an implicit name.
>>
>> So, do you suggest to save information of haw bitmap was loaded or
>> created in
>> BdrvDirtyBitmap structure, to distinguish, how to identify it, by blk
>> name or
>> by node-name? And how this information would be updated on bitmap
>> merge? And
>> what about creating bitmaps?
>>
>> So if one bitmap created in node N by blk name B, and another bitmap
>> created in
>> same node N by node-name N, will we migrated these bitmaps in
>> different ways?
> 
> In the -drive case (historical libvirt), the block device is named, and
> node names are generated (it may be possible to use -drive and still
> create explicit node names, but libvirt will never do that).  You can
> create a bitmap using either ('drive-name','bitmap-name'), or
> ('generated-node-name','bitmap-name'), but for the purposes of
> migration, only the 'drive-name' variant is migrateable.
> 
> In the -blockdev case (upcoming libvirt), the block device is anonymous,
> and all node names are given by libvirt.  Thus, you can only create a
> bitmap using ('node-name','bitmap-name'), but it is also obvious that
> migration will use the 'node-name' variant.
> 
> 

 If it's a problem for libvirt to keep same node-names, why should we
 insist?


>>>
>>> Is it really a problem? If libvirt requests migration of bitmaps, those
>>> bitmaps need to go somewhere. Even if it constructs a different graph
>>> for valid reasons, it should still understand which qcow2 nodes
>>> correlate to which other qcow2 nodes and name them accordingly.
>>>
>>> I don't see why this is actually a terrible constraint. Even in our
>>> mapping proposal we're still using node-names.
>>>
>>>
> 
> The obvious case I see is that if we have a source:
> 
> Backing.qcow2 (contains bitmap1) <- Active.qcow2 (contains bitmap2)
> 
> and we want to migrate AND flatten at the same time, but still preserve
> the bitmaps as a record of changes between the points in time, then
> libvirt needs a way to specify migration to:
> 
> Flattened.qcow2 (contains bitmap1 and bitmap2)
> 
> No matter which node name libvirt assigns to Flattened.qcow2, at least
> one of the two bitmaps on the source will be migrated to a different
> node name on the destination, while still giving the net result of a
> bitmap logically associated with the drive (and not any particular node).
> 

A good example that clearly demonstrates the need for an explicit
mapping provided by libvirt.

> 
>> Ok, I'm not completely against node-name matching, keeping in mind
>> that it is
>> current default behavior anyway. And I see Peter not completely
>> against too.
>>
>> But I'd prefer to select default name from current moment, not
>> involving information
>> of "how bitmap was created or loaded", as it may lead to migrating
>> bitmaps from one
>> node in different ways which seems inconsistent.
> 
> As long as a bitmap never has both names non-generated, we should be
> fine (it either has an explicit drive name and generated node name, or
> it has no drive name and an explicit node name).
> 
>>
>> Current default is blk name. And node-name if blk name is not
>> available. So I think
>> the only thing to fix right now is skipping filters. We possibly may
>> additionally
>> restrict migrating bitmaps without blk name and with generated node-name.
>>
> 




Re: bitmap migration bug with -drive while block mirror runs

2019-10-05 Thread John Snow



On 10/4/19 4:33 AM, Peter Krempa wrote:
> On Thu, Oct 03, 2019 at 19:34:56 -0400, John Snow wrote:
>> On 10/3/19 6:14 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> 03.10.2019 0:35, John Snow wrote:
 On 10/2/19 6:46 AM, Peter Krempa wrote:
>>> 
> 
> [...]
> 
> (I'm sorry if I ignored something which might require input in the
> trimmed part but I don't have enough mental capacity to follow this
> thread fully)
> 

Yeah, understandable -- it's getting a bit long, but I'm trying to make
sure I understand the nuance everywhere before I start pursuing a
particular solution.

I think you caught the important parts in your replies below.

>>>
>>> If it's a problem for libvirt to keep same node-names, why should we insist?
>>>
>>>
>>
>> Is it really a problem? If libvirt requests migration of bitmaps, those
>> bitmaps need to go somewhere. Even if it constructs a different graph
>> for valid reasons, it should still understand which qcow2 nodes
>> correlate to which other qcow2 nodes and name them accordingly.
> 
> Well, no it is not a problem. Since bitmap migration has a migration
> capability and libvirt by default disables all unknown migration
> capabilities we can deal with it.
> 
> We have measures to transfer state to the destination we can
> basically do the equivalent of the explicit mapping but with extra
> steps.
> 
> We know where we want to place the bitmap and thus we can configure
> those nodes appropriately and generate new names for everything else so
> that nothing gets accidentally copied to wrong place.
> 
> My concern is though about the future. Since this is the first instance
> of such a  migration feature which requires node names it's okay because
> we can cheat by naming the destination "appropriately". The problem
> will start though if there will be something else bound to the backend
> of a disk addressed by node names which will have different semantics.
> 
> In that case we won't be able to cheat again.
> 

OK, I see the concern now. Though we're free to name nodes to achieve
the bitmap semantics we want right now, graph reconfigurations in the
future might not be able to fit within the same constraints simultaneously.

Reasonable concern.

Thank you for the illustrative hypothetical.

> Let's assume the following example:
> 
> qemu adds a new feature of migrating the qcow2 L2 cache. This will
> obviously have different implications on when it can be used than
> bitmaps.
> 
> If we'd like to use either of the features but not both together on a
> node there wouldn't be a possibility to achieve that.
> 
> The thing about bitmaps is that they are not really bound to the image
> itself but rather the data in the image. As long as the user provides a
> image with exactly the same contents the VM can use it and the bitmap
> will be correct for it.
> 
> We use this in non-shared storage migration where we by default flatten
> the backing chain into a new image. In such case a bitmap is still valid
> but the cache in the hypothetical example is not valid to be copied over
> for the same node name.
> 
> At the very least the nuances of the capability should be documented so
> that we don't have to second guess what is going to happen.
> 

OK, understood.

>> I don't see why this is actually a terrible constraint. Even in our
>> mapping proposal we're still using node-names.
>>
>>
>> So here's a summary of where I stand right now:
>>
>> - I want an API in QEMU that doesn't require libvirt.
>>
>> - I want to accommodate libvirt, but don't understand the requirement
>> that node-names must be ephemeral.
> 
> As I've outlined above, the node names must not be ephemeral but on the
> same note it's then necessary to clarify when they must be stable
> accross migration and when they must be different.
> 
> In the above example I'm outlining an image which has the same data but
> it's a different image (it was converted for example). In that case the
> bitmap migration would imply the same node name, but at the same time
> the image is completely different and any other feature may be
> incompatible with it.
> 
> The same is possible e.g. when you have multiple protocols to access the
> same data are they the same thing and thus warrant the same node name?
> or are they different.
> 
> Treating node names as ephemeral has the advantage of not trying to
> assume the equivalence of the images on the migration channel and not
> having to try to figure out whether they are "euqivalent enough" for the
> given feature.
> 
>>
>> - I would like to define a set of default behaviors (when bitmap
>> migration is enabled) that migrates only bitmaps it is confident it can
>> do so correctly and error out when it cannot.
> 
> This requires also defining a set of external constraints when it will
> work. Note that they can differ with other features.
> 
>>
>> - I'd like to amend the bitmap device name resolution to accommodate the
>> drive-mirror case.
>>
>> - Acknowledging that there might be cases 

Re: [PATCH 00/11] hw: Convert various reset() handler to DeviceReset

2019-10-05 Thread Michael S. Tsirkin
On Thu, Sep 26, 2019 at 05:17:22PM +0200, Philippe Mathieu-Daudé wrote:
> Hi.
> 
> Following the thread discussion between Peter/Markus/Damien about
> reset handlers:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg617103.html
> I started to remove qemu_register_reset() calls from few qdevified
> devices (the trivial ones).
> 
> Regards,
> 
> Phil.

How do you want these patches merged? Trivial tree?

> Philippe Mathieu-Daudé (11):
>   hw/acpi/piix4: Convert reset handler to DeviceReset
>   hw/ide/piix: Convert reset handler to DeviceReset
>   hw/isa/piix4: Convert reset handler to DeviceReset
>   hw/pci-host/piix: Convert reset handler to DeviceReset
>   hw/ide/sii3112: Convert reset handler to DeviceReset
>   hw/ide/via82c: Convert reset handler to DeviceReset
>   hw/isa/vt82c686: Convert reset handler to DeviceReset
>   hw/input/lm832x: Convert reset handler to DeviceReset
>   hw/pci-host/bonito: Convert reset handler to DeviceReset
>   hw/timer/etraxfs: Convert reset handler to DeviceReset
>   hw/misc/vmcoreinfo: Convert reset handler to DeviceReset
> 
>  hw/acpi/piix4.c  |  7 +++
>  hw/ide/piix.c|  8 +++-
>  hw/ide/sii3112.c |  7 +++
>  hw/ide/via.c | 10 --
>  hw/input/lm832x.c| 12 +---
>  hw/isa/piix4.c   |  7 +++
>  hw/isa/vt82c686.c| 11 ---
>  hw/misc/vmcoreinfo.c |  5 ++---
>  hw/pci-host/bonito.c |  8 +++-
>  hw/pci-host/piix.c   |  8 +++-
>  hw/timer/etraxfs_timer.c |  7 +++
>  11 files changed, 36 insertions(+), 54 deletions(-)
> 
> -- 
> 2.20.1



[PATCH 1/1] MAINTAINERS: Add Vladimir as a reviewer for bitmaps

2019-10-05 Thread John Snow
I already try to make sure all bitmaps patches have been reviewed by both
Red Hat and Virtuozzo anyway, so this formalizes the arrangement.

Fam meanwhile is no longer as active, so I am removing him as a co-maintainer
simply to reflect the current practice.

Signed-off-by: John Snow 
---
 MAINTAINERS | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 21264eae9c4..87e80ae110c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1818,8 +1818,8 @@ F: qapi/transaction.json
 T: git https://repo.or.cz/qemu/armbru.git block-next
 
 Dirty Bitmaps
-M: Fam Zheng 
 M: John Snow 
+R: Vladimir Sementsov-Ogievskiy 
 L: qemu-block@nongnu.org
 S: Supported
 F: util/hbitmap.c
@@ -1828,7 +1828,6 @@ F: include/qemu/hbitmap.h
 F: include/block/dirty-bitmap.h
 F: tests/test-hbitmap.c
 F: docs/interop/bitmaps.rst
-T: git https://github.com/famz/qemu.git bitmaps
 T: git https://github.com/jnsnow/qemu.git bitmaps
 
 Character device backends
-- 
2.21.0




[PATCH 0/1] MAINTAINERS: Add Vladimir as a reviewer for bitmaps

2019-10-05 Thread John Snow
Hi; I'll be going away on an extended trip this November and have made
arrangements for reviews to be handled in my absence. I've asked Vladimir
to take point on any reviews for patches he didn't author, and have asked
Eric to take point on reviewing any of Vladimir's patches for this tree.

Because Virtuozzo contributes so heavily to this area, I've always liked
to get Vladimir's approval on patches before merging them anyway, so
this change formalizes the existing practice.

John Snow (1):
  MAINTAINERS: Add Vladimir as a reviewer for bitmaps

 MAINTAINERS | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

-- 
2.21.0




[PATCH v6 1/4] block/replication.c: Ignore requests after failover

2019-10-05 Thread Lukas Straub
After failover the Secondary side of replication shouldn't change state, because
it now functions as our primary disk.

In replication_start, replication_do_checkpoint, replication_stop, ignore
the request if current state is BLOCK_REPLICATION_DONE (sucessful failover) or
BLOCK_REPLICATION_FAILOVER (failover in progres i.e. currently merging active
and hidden images into the base image).

Signed-off-by: Lukas Straub 
Reviewed-by: Zhang Chen 
---
 block/replication.c | 38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 3d4dedddfc..97cc65c0cf 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -454,6 +454,17 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 aio_context_acquire(aio_context);
 s = bs->opaque;

+if (s->stage == BLOCK_REPLICATION_DONE ||
+s->stage == BLOCK_REPLICATION_FAILOVER) {
+/*
+ * This case happens when a secondary is promoted to primary.
+ * Ignore the request because the secondary side of replication
+ * doesn't have to do anything anymore.
+ */
+aio_context_release(aio_context);
+return;
+}
+
 if (s->stage != BLOCK_REPLICATION_NONE) {
 error_setg(errp, "Block replication is running or done");
 aio_context_release(aio_context);
@@ -529,8 +540,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
"Block device is in use by internal backup job");

 top_bs = bdrv_lookup_bs(s->top_id, s->top_id, NULL);
-if (!top_bs || !bdrv_is_root_node(top_bs) ||
-!check_top_bs(top_bs, bs)) {
+if (!top_bs || !check_top_bs(top_bs, bs)) {
 error_setg(errp, "No top_bs or it is invalid");
 reopen_backing_file(bs, false, NULL);
 aio_context_release(aio_context);
@@ -577,6 +587,17 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
 aio_context_acquire(aio_context);
 s = bs->opaque;

+if (s->stage == BLOCK_REPLICATION_DONE ||
+s->stage == BLOCK_REPLICATION_FAILOVER) {
+/*
+ * This case happens when a secondary was promoted to primary.
+ * Ignore the request because the secondary side of replication
+ * doesn't have to do anything anymore.
+ */
+aio_context_release(aio_context);
+return;
+}
+
 if (s->mode == REPLICATION_MODE_SECONDARY) {
 secondary_do_checkpoint(s, errp);
 }
@@ -593,7 +614,7 @@ static void replication_get_error(ReplicationState *rs, 
Error **errp)
 aio_context_acquire(aio_context);
 s = bs->opaque;

-if (s->stage != BLOCK_REPLICATION_RUNNING) {
+if (s->stage == BLOCK_REPLICATION_NONE) {
 error_setg(errp, "Block replication is not running");
 aio_context_release(aio_context);
 return;
@@ -635,6 +656,17 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
 aio_context_acquire(aio_context);
 s = bs->opaque;

+if (s->stage == BLOCK_REPLICATION_DONE ||
+s->stage == BLOCK_REPLICATION_FAILOVER) {
+/*
+ * This case happens when a secondary was promoted to primary.
+ * Ignore the request because the secondary side of replication
+ * doesn't have to do anything anymore.
+ */
+aio_context_release(aio_context);
+return;
+}
+
 if (s->stage != BLOCK_REPLICATION_RUNNING) {
 error_setg(errp, "Block replication is not running");
 aio_context_release(aio_context);
--
2.20.1




[PATCH v6 4/4] colo: Update Documentation for continuous replication

2019-10-05 Thread Lukas Straub
Document the qemu command-line and qmp commands for continuous replication

Signed-off-by: Lukas Straub 
---
 docs/COLO-FT.txt   | 213 +++--
 docs/block-replication.txt |  28 +++--
 2 files changed, 174 insertions(+), 67 deletions(-)

diff --git a/docs/COLO-FT.txt b/docs/COLO-FT.txt
index ad24680d13..bc1a0ccb99 100644
--- a/docs/COLO-FT.txt
+++ b/docs/COLO-FT.txt
@@ -145,35 +145,65 @@ The diagram just shows the main qmp command, you can get 
the detail
 in test procedure.
 
 == Test procedure ==
-1. Startup qemu
-Primary:
-# qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -qmp stdio -name primary \
-  -device piix3-usb-uhci -vnc :7 \
-  -device usb-tablet -netdev tap,id=hn0,vhost=off \
-  -device virtio-net-pci,id=net-pci0,netdev=hn0 \
-  -drive 
if=virtio,id=primary-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
- children.0.file.filename=1.raw,\
- children.0.driver=raw -S
-Secondary:
-# qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -qmp stdio -name secondary \
-  -device piix3-usb-uhci -vnc :7 \
-  -device usb-tablet -netdev tap,id=hn0,vhost=off \
-  -device virtio-net-pci,id=net-pci0,netdev=hn0 \
-  -drive 
if=none,id=secondary-disk0,file.filename=1.raw,driver=raw,node-name=node0 \
-  -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\
- file.driver=qcow2,top-id=active-disk0,\
- file.file.filename=/mnt/ramfs/active_disk.img,\
- file.backing.driver=qcow2,\
- file.backing.file.filename=/mnt/ramfs/hidden_disk.img,\
- file.backing.backing=secondary-disk0 \
-  -incoming tcp:0:
-
-2. On Secondary VM's QEMU monitor, issue command
+Note: Here we are running both instances on the same Host for testing,
+change the IP Addresses if you want to run it on two Hosts. Initally
+127.0.0.1 is the Primary Host and 127.0.0.2 is the Secondary Host.
+
+== Startup qemu ==
+1. Primary:
+Note: Initally, $imagefolder/primary.qcow2 needs to be copied to all Hosts.
+# imagefolder="/mnt/vms/colo-test-primary"
+
+# qemu-system-x86_64 -enable-kvm -cpu qemu64,+kvmclock -m 512 -smp 1 -qmp 
stdio \
+   -device piix3-usb-uhci -device usb-tablet -name primary \
+   -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \
+   -device rtl8139,id=e0,netdev=hn0 \
+   -chardev socket,id=mirror0,host=0.0.0.0,port=9003,server,nowait \
+   -chardev socket,id=compare1,host=0.0.0.0,port=9004,server,wait \
+   -chardev socket,id=compare0,host=127.0.0.1,port=9001,server,nowait \
+   -chardev socket,id=compare0-0,host=127.0.0.1,port=9001 \
+   -chardev socket,id=compare_out,host=127.0.0.1,port=9005,server,nowait \
+   -chardev socket,id=compare_out0,host=127.0.0.1,port=9005 \
+   -object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0 \
+   -object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out \
+   -object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0 \
+   -object iothread,id=iothread1 \
+   -object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,\
+outdev=compare_out0,iothread=iothread1 \
+   -drive 
if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
+children.0.file.filename=$imagefolder/primary.qcow2,children.0.driver=qcow2 -S
+
+2. Secondary:
+# imagefolder="/mnt/vms/colo-test-secondary"
+# primary_ip=127.0.0.1
+
+# qemu-img create -f qcow2 $imagefolder/secondary-active.qcow2 10G
+
+# qemu-img create -f qcow2 $imagefolder/secondary-hidden.qcow2 10G
+
+# qemu-system-x86_64 -enable-kvm -cpu qemu64,+kvmclock -m 512 -smp 1 -qmp 
stdio \
+   -device piix3-usb-uhci -device usb-tablet -name secondary \
+   -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \
+   -device rtl8139,id=e0,netdev=hn0 \
+   -chardev socket,id=red0,host=$primary_ip,port=9003,reconnect=1 \
+   -chardev socket,id=red1,host=$primary_ip,port=9004,reconnect=1 \
+   -object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0 \
+   -object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1 \
+   -object filter-rewriter,id=rew0,netdev=hn0,queue=all \
+   -drive 
if=none,id=parent0,file.filename=$imagefolder/primary.qcow2,driver=qcow2 \
+   -drive 
if=none,id=childs0,driver=replication,mode=secondary,file.driver=qcow2,\
+top-id=childs0,file.file.filename=$imagefolder/secondary-active.qcow2,\
+file.backing.driver=qcow2,file.backing.file.filename=$imagefolder/secondary-hidden.qcow2,\
+file.backing.backing=parent0 \
+   -drive 
if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
+children.0=childs0 \
+   -incoming tcp:0.0.0.0:9998
+
+
+3. On Secondary VM's QEMU monitor, issue command
 {'execute':'qmp_capabilities'}
-{ 'execute': 'nbd-server-start',
-  'arguments': {'addr': {'type': 'inet', 'data': {'host': 'xx.xx.xx.xx', 
'port': '8889'} } }
-}
-{'execute': 'nbd-server-add', 'arguments': {'device': 'secondary-disk0', 
'writable': true } }
+{'execute': 'nbd-server-start', 'arguments': {'addr': {'type': 'inet', 'data': 

[PATCH v6 3/4] net/filter.c: Add Options to insert filters anywhere in the filter list

2019-10-05 Thread Lukas Straub
To switch the Secondary to Primary, we need to insert new filters
before the filter-rewriter.

Add the options insert= and position= to be able to insert filters
anywhere in the filter list.

position should be "head" or "tail" to insert at the head or
tail of the filter list or it should be "id=" to specify
the id of another filter.
insert should be either "before" or "behind" to specify where to
insert the new filter relative to the one specified with position.

Signed-off-by: Lukas Straub 
Reviewed-by: Zhang Chen 
---
 include/net/filter.h |  2 +
 net/filter.c | 92 +++-
 qemu-options.hx  | 31 ---
 3 files changed, 119 insertions(+), 6 deletions(-)

diff --git a/include/net/filter.h b/include/net/filter.h
index 49da666ac0..22a723305b 100644
--- a/include/net/filter.h
+++ b/include/net/filter.h
@@ -62,6 +62,8 @@ struct NetFilterState {
 NetClientState *netdev;
 NetFilterDirection direction;
 bool on;
+char *position;
+bool insert_before_flag;
 QTAILQ_ENTRY(NetFilterState) next;
 };

diff --git a/net/filter.c b/net/filter.c
index 28d1930db7..cd2ef9e979 100644
--- a/net/filter.c
+++ b/net/filter.c
@@ -171,11 +171,47 @@ static void netfilter_set_status(Object *obj, const char 
*str, Error **errp)
 }
 }

+static char *netfilter_get_position(Object *obj, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+return g_strdup(nf->position);
+}
+
+static void netfilter_set_position(Object *obj, const char *str, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+nf->position = g_strdup(str);
+}
+
+static char *netfilter_get_insert(Object *obj, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+return nf->insert_before_flag ? g_strdup("before") : g_strdup("behind");
+}
+
+static void netfilter_set_insert(Object *obj, const char *str, Error **errp)
+{
+NetFilterState *nf = NETFILTER(obj);
+
+if (strcmp(str, "before") && strcmp(str, "behind")) {
+error_setg(errp, "Invalid value for netfilter insert, "
+ "should be 'before' or 'behind'");
+return;
+}
+
+nf->insert_before_flag = !strcmp(str, "before");
+}
+
 static void netfilter_init(Object *obj)
 {
 NetFilterState *nf = NETFILTER(obj);

 nf->on = true;
+nf->insert_before_flag = false;
+nf->position = g_strdup("tail");

 object_property_add_str(obj, "netdev",
 netfilter_get_netdev_id, netfilter_set_netdev_id,
@@ -187,11 +223,18 @@ static void netfilter_init(Object *obj)
 object_property_add_str(obj, "status",
 netfilter_get_status, netfilter_set_status,
 NULL);
+object_property_add_str(obj, "position",
+netfilter_get_position, netfilter_set_position,
+NULL);
+object_property_add_str(obj, "insert",
+netfilter_get_insert, netfilter_set_insert,
+NULL);
 }

 static void netfilter_complete(UserCreatable *uc, Error **errp)
 {
 NetFilterState *nf = NETFILTER(uc);
+NetFilterState *position = NULL;
 NetClientState *ncs[MAX_QUEUE_NUM];
 NetFilterClass *nfc = NETFILTER_GET_CLASS(uc);
 int queues;
@@ -219,6 +262,41 @@ static void netfilter_complete(UserCreatable *uc, Error 
**errp)
 return;
 }

+if (strcmp(nf->position, "head") && strcmp(nf->position, "tail")) {
+Object *container;
+Object *obj;
+char *position_id;
+
+if (!g_str_has_prefix(nf->position, "id=")) {
+error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "position",
+   "'head', 'tail' or 'id='");
+return;
+}
+
+/* get the id from the string */
+position_id = g_strndup(nf->position + 3, strlen(nf->position) - 3);
+
+/* Search for the position to insert before/behind */
+container = object_get_objects_root();
+obj = object_resolve_path_component(container, position_id);
+if (!obj) {
+error_setg(errp, "filter '%s' not found", position_id);
+g_free(position_id);
+return;
+}
+
+position = NETFILTER(obj);
+
+if (position->netdev != ncs[0]) {
+error_setg(errp, "filter '%s' belongs to a different netdev",
+position_id);
+g_free(position_id);
+return;
+}
+
+g_free(position_id);
+}
+
 nf->netdev = ncs[0];

 if (nfc->setup) {
@@ -228,7 +306,18 @@ static void netfilter_complete(UserCreatable *uc, Error 
**errp)
 return;
 }
 }
-QTAILQ_INSERT_TAIL(>netdev->filters, nf, next);
+
+if (position) {
+if (nf->insert_before_flag) {
+QTAILQ_INSERT_BEFORE(position, nf, next);
+} else {
+QTAILQ_INSERT_AFTER(>netdev->filters, position, nf, next);
+}

[PATCH v6 0/4] colo: Add support for continuous replication

2019-10-05 Thread Lukas Straub
Hello Everyone,
These Patches add support for continuous replication to colo. This means
that after the Primary fails and the Secondary did a failover, the Secondary
can then become Primary and resume replication to a new Secondary.

Regards,
Lukas Straub

v6:
 - properly documented the position= and insert= options
 - renamed replication test
 - clarified documentation by using different ip's for primary and secondary
 - added Reviewed-by tags

v5:
 - change syntax for the position= parameter
 - fix spelling mistake

v4:
 - fix checkpatch.pl warnings

v3:
 - add test for replication changes
 - check if the filter to be inserted before/behind belongs to the same 
interface
 - fix the error message for the position= parameter
 - rename term "after" -> "behind" and variable "insert_before" -> 
"insert_before_flag"
 - document the quorum node on the secondary side
 - simplify quorum parameters in documentation
 - remove trailing spaces in documentation
 - clarify the testing procedure in documentation

v2:
 - fix email formating
 - fix checkpatch.pl warnings
 - fix patchew error
 - clearer commit messages


Lukas Straub (4):
  block/replication.c: Ignore requests after failover
  tests/test-replication.c: Add test for for secondary node continuing
replication
  net/filter.c: Add Options to insert filters anywhere in the filter
list
  colo: Update Documentation for continuous replication

 block/replication.c|  38 ++-
 docs/COLO-FT.txt   | 213 +++--
 docs/block-replication.txt |  28 +++--
 include/net/filter.h   |   2 +
 net/filter.c   |  92 +++-
 qemu-options.hx|  31 +-
 tests/test-replication.c   |  52 +
 7 files changed, 380 insertions(+), 76 deletions(-)

--
2.20.1



[PATCH v6 2/4] tests/test-replication.c: Add test for for secondary node continuing replication

2019-10-05 Thread Lukas Straub
This simulates the case that happens when we resume COLO after failover.

Signed-off-by: Lukas Straub 
---
 tests/test-replication.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/tests/test-replication.c b/tests/test-replication.c
index f085d1993a..8e18ecd998 100644
--- a/tests/test-replication.c
+++ b/tests/test-replication.c
@@ -489,6 +489,56 @@ static void test_secondary_stop(void)
 teardown_secondary();
 }

+static void test_secondary_continuous_replication(void)
+{
+BlockBackend *top_blk, *local_blk;
+Error *local_err = NULL;
+
+top_blk = start_secondary();
+replication_start_all(REPLICATION_MODE_SECONDARY, _err);
+g_assert(!local_err);
+
+/* write 0x22 to s_local_disk (IMG_SIZE / 2, IMG_SIZE) */
+local_blk = blk_by_name(S_LOCAL_DISK_ID);
+test_blk_write(local_blk, 0x22, IMG_SIZE / 2, IMG_SIZE / 2, false);
+
+/* replication will backup s_local_disk to s_hidden_disk */
+test_blk_read(top_blk, 0x11, IMG_SIZE / 2,
+  IMG_SIZE / 2, 0, IMG_SIZE, false);
+
+/* write 0x33 to s_active_disk (0, IMG_SIZE / 2) */
+test_blk_write(top_blk, 0x33, 0, IMG_SIZE / 2, false);
+
+/* do failover (active commit) */
+replication_stop_all(true, _err);
+g_assert(!local_err);
+
+/* it should ignore all requests from now on */
+
+/* start after failover */
+replication_start_all(REPLICATION_MODE_PRIMARY, _err);
+g_assert(!local_err);
+
+/* checkpoint */
+replication_do_checkpoint_all(_err);
+g_assert(!local_err);
+
+/* stop */
+replication_stop_all(true, _err);
+g_assert(!local_err);
+
+/* read from s_local_disk (0, IMG_SIZE / 2) */
+test_blk_read(top_blk, 0x33, 0, IMG_SIZE / 2,
+  0, IMG_SIZE / 2, false);
+
+
+/* read from s_local_disk (IMG_SIZE / 2, IMG_SIZE) */
+test_blk_read(top_blk, 0x22, IMG_SIZE / 2,
+  IMG_SIZE / 2, 0, IMG_SIZE, false);
+
+teardown_secondary();
+}
+
 static void test_secondary_do_checkpoint(void)
 {
 BlockBackend *top_blk, *local_blk;
@@ -584,6 +634,8 @@ int main(int argc, char **argv)
 g_test_add_func("/replication/secondary/write", test_secondary_write);
 g_test_add_func("/replication/secondary/start", test_secondary_start);
 g_test_add_func("/replication/secondary/stop",  test_secondary_stop);
+g_test_add_func("/replication/secondary/continuous_replication",
+test_secondary_continuous_replication);
 g_test_add_func("/replication/secondary/do_checkpoint",
 test_secondary_do_checkpoint);
 g_test_add_func("/replication/secondary/get_error_all",
--
2.20.1