Re: [Qemu-devel] how to judge machine's state in user-level/kernel-level
On Wed, Feb 23, 2011 at 4:05 AM, Jianhui Yue jianhui@gmail.com wrote: hi, All, I am wondering how to determine machine's state in user-level or kernel level from Operating System perspective. what's meaning of CPUState.exception_is_int? Thanks. If kernel level from Operating System perspective means privileged then you could look at the current privilege level in a target architecture-specific way. So on x86 look at the code segment selector. Stefan
Re: [Qemu-devel] [PATCH 03/22] migration: Fold MigrationState into FdMigrationState
2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration-exec.c | 10 +- migration-fd.c | 10 +- migration-tcp.c | 10 +- migration-unix.c | 10 +- migration.c | 11 +-- migration.h | 23 +-- 6 files changed, 30 insertions(+), 44 deletions(-) diff --git a/migration-exec.c b/migration-exec.c index b49a475..02b0667 100644 --- a/migration-exec.c +++ b/migration-exec.c @@ -93,12 +93,12 @@ FdMigrationState *exec_start_outgoing_migration(Monitor *mon, s-close = exec_close; s-get_error = file_errno; s-write = file_write; - s-mig_state.cancel = migrate_fd_cancel; - s-mig_state.get_status = migrate_fd_get_status; - s-mig_state.release = migrate_fd_release; + s-cancel = migrate_fd_cancel; + s-get_status = migrate_fd_get_status; + s-release = migrate_fd_release; - s-mig_state.blk = blk; - s-mig_state.shared = inc; + s-blk = blk; + s-shared = inc; s-state = MIG_STATE_ACTIVE; s-mon = NULL; diff --git a/migration-fd.c b/migration-fd.c index bd5e8a9..ccba86b 100644 --- a/migration-fd.c +++ b/migration-fd.c @@ -76,12 +76,12 @@ FdMigrationState *fd_start_outgoing_migration(Monitor *mon, s-get_error = fd_errno; s-write = fd_write; s-close = fd_close; - s-mig_state.cancel = migrate_fd_cancel; - s-mig_state.get_status = migrate_fd_get_status; - s-mig_state.release = migrate_fd_release; + s-cancel = migrate_fd_cancel; + s-get_status = migrate_fd_get_status; + s-release = migrate_fd_release; - s-mig_state.blk = blk; - s-mig_state.shared = inc; + s-blk = blk; + s-shared = inc; s-state = MIG_STATE_ACTIVE; s-mon = NULL; diff --git a/migration-tcp.c b/migration-tcp.c index 355bc37..02b01ed 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -95,12 +95,12 @@ FdMigrationState *tcp_start_outgoing_migration(Monitor *mon, s-get_error = socket_errno; s-write = socket_write; s-close = tcp_close; - s-mig_state.cancel = migrate_fd_cancel; - s-mig_state.get_status = migrate_fd_get_status; - s-mig_state.release = migrate_fd_release; + s-cancel = migrate_fd_cancel; + s-get_status = migrate_fd_get_status; + s-release = migrate_fd_release; - s-mig_state.blk = blk; - s-mig_state.shared = inc; + s-blk = blk; + s-shared = inc; s-state = MIG_STATE_ACTIVE; s-mon = NULL; diff --git a/migration-unix.c b/migration-unix.c index b9b0dbf..fb73f46 100644 --- a/migration-unix.c +++ b/migration-unix.c @@ -94,12 +94,12 @@ FdMigrationState *unix_start_outgoing_migration(Monitor *mon, s-get_error = unix_errno; s-write = unix_write; s-close = unix_close; - s-mig_state.cancel = migrate_fd_cancel; - s-mig_state.get_status = migrate_fd_get_status; - s-mig_state.release = migrate_fd_release; + s-cancel = migrate_fd_cancel; + s-get_status = migrate_fd_get_status; + s-release = migrate_fd_release; - s-mig_state.blk = blk; - s-mig_state.shared = inc; + s-blk = blk; + s-shared = inc; s-state = MIG_STATE_ACTIVE; s-mon = NULL; diff --git a/migration.c b/migration.c index 3a371a3..dd4cdab 100644 --- a/migration.c +++ b/migration.c @@ -86,7 +86,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) const char *uri = qdict_get_str(qdict, uri); if (current_migration - current_migration-mig_state.get_status(current_migration) == MIG_STATE_ACTIVE) { + current_migration-get_status(current_migration) == MIG_STATE_ACTIVE) { monitor_printf(mon, migration already in progress\n); return -1; } @@ -120,7 +120,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) } if (current_migration) { - current_migration-mig_state.release(current_migration); + current_migration-release(current_migration); } current_migration = s; @@ -133,7 +133,7 @@ int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data) FdMigrationState *s = current_migration; if (s) - s-mig_state.cancel(s); + s-cancel(s); return 0; } @@ -229,7 +229,7 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) QDict *qdict; if (current_migration) { - MigrationState *s = current_migration-mig_state; + FdMigrationState *s = current_migration; switch (s-get_status(current_migration)) { case MIG_STATE_ACTIVE: @@ -353,8 +353,7 @@ void migrate_fd_connect(FdMigrationState *s) migrate_fd_close); DPRINTF(beginning savevm\n); - ret = qemu_savevm_state_begin(s-mon, s-file, s-mig_state.blk, - s-mig_state.shared); + ret = qemu_savevm_state_begin(s-mon, s-file, s-blk, s-shared);
Re: [Qemu-devel] [PATCH 17/22] migration: use global variable directly
2011/2/23 Juan Quintela quint...@redhat.com: We are setting a pointer to a local variable in the previous line, just use the global variable directly. We remove the -file test because it is already done inside qemu_file_set_rate_limit() function. Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 6 ++ 1 files changed, 2 insertions(+), 4 deletions(-) diff --git a/migration.c b/migration.c index d7dfe1e..accc6e4 100644 --- a/migration.c +++ b/migration.c @@ -451,7 +451,6 @@ int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data) int do_migrate_set_speed(Monitor *mon, const QDict *qdict, QObject **ret_data) { int64_t d; - MigrationState *s; d = qdict_get_int(qdict, value); if (d 0) { @@ -459,9 +458,8 @@ int do_migrate_set_speed(Monitor *mon, const QDict *qdict, QObject **ret_data) } max_throttle = d; - s = current_migration; - if (s s-file) { - qemu_file_set_rate_limit(s-file, max_throttle); + if (current_migration) { + qemu_file_set_rate_limit(current_migration-file, max_throttle); } return 0; Looks good to me. Yoshi -- 1.7.4
Re: [Qemu-devel] Re: QEMU regression problems - Update FPU
On 18 February 2011 07:12, Gerhard Wiesinger li...@wiesinger.com wrote: Issue 1.) with FPU still present I tracked down the problematic code and it is a rounding error from double precision to 64bit floats: Any ideas how to fix such an issue in general? QEMU result in ST0: 0.42925860786976457 (wrong emulated) KVM result in ST0: 0.42925860786975449 (correct) This is an error when running QEMU in TCG mode, right? At the moment x86 is the odd-one-out in that it doesn't use CONFIG_SOFTFLOAT for its FPU emulation, so somebody has made an explicit choice of preferring speed over accuracy, and I am unsurprised that there are rounding errors as a result. -- PMM
Re: [Qemu-devel] [PATCH 12/22] migration: Use migrate_fd_error() in last place that set status to ERROR
2011/2/23 Juan Quintela quint...@redhat.com: We are also calling to migrate_fd_cleanup(), but notice that it is the right thing to do. Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 6 +- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/migration.c b/migration.c index ab98664..3983257 100644 --- a/migration.c +++ b/migration.c @@ -351,11 +351,7 @@ static ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size if (ret == -EAGAIN) { qemu_set_fd_handler2(s-fd, NULL, NULL, migrate_fd_put_notify, s); } else if (ret 0) { - if (s-mon) { - monitor_resume(s-mon); - } - s-state = MIG_STATE_ERROR; - notifier_list_notify(migration_state_notifiers); + migrate_fd_error(s); } Are you sure about this? migrate_fd_error may call qemu_fclose through migrate_fd_cleanup, but the caller of migrate_fd_put_buffer gets called by buffered_file that sits under qemu file. In my previous posting, http://permalink.gmane.org/gmane.comp.emulators.qemu/94688 I thought migrate_fd_put_buffer should just return error, and let the original caller (migrate_fd_put_notify or any) to actually call migrate_fd_error. Thanks, Yoshi return ret; -- 1.7.4
Re: [Qemu-devel] [PATCH] `qdev_free` when unplug a pci device
Isaku Yamahata yamah...@valinux.co.jp writes: On Tue, Feb 22, 2011 at 06:36:20PM +0100, William Dauchy wrote: `qdev_free` when unplug a pci device It resolves a bug when detaching/attaching a network device # virsh detach-interface dom0 network --mac 52:54:00:f6:84:ba Interface detached successfully # virsh attach-interface dom0 network default --mac 52:54:00:f6:84:ba error: Failed to attach interface error: internal error unable to execute QEMU command 'device_add': Duplicate ID 'net0' for device A detached pci device was not freed of the list `qemu_device_opts` --- hw/pci.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 8b76cea..1e9a8f0 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1671,13 +1671,16 @@ static int pci_unplug_device(DeviceState *qdev) { PCIDevice *dev = DO_UPCAST(PCIDevice, qdev, qdev); PCIDeviceInfo *info = container_of(qdev-info, PCIDeviceInfo, qdev); +int unplug; if (info-no_hotplug) { qerror_report(QERR_DEVICE_NO_HOTPLUG, info-qdev.name); return -1; } -return dev-bus-hotplug(dev-bus-hotplug_qdev, dev, +unplug = dev-bus-hotplug(dev-bus-hotplug_qdev, dev, PCI_HOTPLUG_DISABLED); +qdev_free(qdev); Good catch. qdev_free() should be called only when unplug had succeeded. Although piix4 unplug always successes, pcie unplug may fail. +return unplug; } void pci_qdev_register(PCIDeviceInfo *info) I don't think this patch is correct. Let me explain. Device hot unplug is *not* guaranteed to succeed. For some buses, such as USB, it always succeeds immediately, i.e. when the device_del monitor command finishes, the device is gone. Live is good. But for PCI, device_del merely initiates the ACPI unplug rain dance. It doesn't wait for the dance to complete. Why? The dance can take an unpredictable amount of time, including forever. Problem: Subsequent device_add can fail if it reuses the qdev ID or PCI slot, and the unplug has not yet completed (race condition), or it failed. Yes, Virginia, PCI hotplug *can* fail. When unplug succeeds, the qdev is automatically destroyed. pciej_write() does that for PIIX4. Looks like pcie_cap_slot_event() does it for PCIE. Your patch adds a *second* qdev_free(). No good.
Re: [Qemu-devel] [PATCH 08/22] migration: Check that migration is active before cancel it
2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/migration.c b/migration.c index 397a0b9..55f58c8 100644 --- a/migration.c +++ b/migration.c @@ -138,7 +138,7 @@ int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data) { MigrationState *s = current_migration; - if (s) + if (s s-get_status(s) == MIG_STATE_ACTIVE) s-cancel(s); return 0; Why don't you remove *s again? Yoshi -- 1.7.4
Re: [Qemu-devel] [PATCH 22/22] migration: Make state definitions local
2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 6 ++ migration.h | 6 -- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/migration.c b/migration.c index 383ebaf..90fc2a0 100644 --- a/migration.c +++ b/migration.c @@ -31,6 +31,12 @@ do { } while (0) #endif +#define MIG_STATE_ERROR -1 +#define MIG_STATE_NONE 0 +#define MIG_STATE_CANCELLED 1 +#define MIG_STATE_ACTIVE 2 +#define MIG_STATE_COMPLETED 3 + static MigrationState current_migration = { .state = MIG_STATE_NONE, /* Migration speed throttling */ diff --git a/migration.h b/migration.h index 9457807..493fbe5 100644 --- a/migration.h +++ b/migration.h @@ -18,12 +18,6 @@ #include qemu-common.h #include notify.h -#define MIG_STATE_ERROR -1 -#define MIG_STATE_NONE 0 -#define MIG_STATE_CANCELLED 1 -#define MIG_STATE_ACTIVE 2 -#define MIG_STATE_COMPLETED 3 - Although you're right, I would prefer to keep it so that somebody outside of migration may understand the status in the future if there are no harms. Yoshi typedef struct MigrationState MigrationState; struct MigrationState -- 1.7.4
[Qemu-devel] [PATCH] reset nic when we init it
When we hotplug a nic to guest OS, we can see it but the mac address of nic is wrong. The reason of this bug is guest OS does not reset nic, and we init eeprom(rtl8139) only when we reset nic. I think we should reset nic when we init it. Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- hw/e1000.c | 14 -- hw/pcnet-pci.c | 19 +++ hw/rtl8139.c |1 + 3 files changed, 20 insertions(+), 14 deletions(-) diff --git a/hw/e1000.c b/hw/e1000.c index 0a4574c..5bd91a8 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -1137,6 +1137,12 @@ static NetClientInfo net_e1000_info = { .link_status_changed = e1000_set_link_status, }; +static void qdev_e1000_reset(DeviceState *dev) +{ +E1000State *d = DO_UPCAST(E1000State, dev.qdev, dev); +e1000_reset(d); +} + static int pci_e1000_init(PCIDevice *pci_dev) { E1000State *d = DO_UPCAST(E1000State, dev, pci_dev); @@ -1186,13 +1192,9 @@ static int pci_e1000_init(PCIDevice *pci_dev) add_boot_device_path(d-conf.bootindex, pci_dev-qdev, /ethernet-phy@0); -return 0; -} +qdev_e1000_reset(pci_dev-qdev); -static void qdev_e1000_reset(DeviceState *dev) -{ -E1000State *d = DO_UPCAST(E1000State, dev.qdev, dev); -e1000_reset(d); +return 0; } static PCIDeviceInfo e1000_info = { diff --git a/hw/pcnet-pci.c b/hw/pcnet-pci.c index 339a401..205f2c7 100644 --- a/hw/pcnet-pci.c +++ b/hw/pcnet-pci.c @@ -265,11 +265,19 @@ static NetClientInfo net_pci_pcnet_info = { .cleanup = pci_pcnet_cleanup, }; +static void pci_reset(DeviceState *dev) +{ +PCIPCNetState *d = DO_UPCAST(PCIPCNetState, pci_dev.qdev, dev); + +pcnet_h_reset(d-state); +} + static int pci_pcnet_init(PCIDevice *pci_dev) { PCIPCNetState *d = DO_UPCAST(PCIPCNetState, pci_dev, pci_dev); PCNetState *s = d-state; uint8_t *pci_conf; +int ret; #if 0 printf(sizeof(RMD)=%d, sizeof(TMD)=%d\n, @@ -315,14 +323,9 @@ static int pci_pcnet_init(PCIDevice *pci_dev) } } -return pcnet_common_init(pci_dev-qdev, s, net_pci_pcnet_info); -} - -static void pci_reset(DeviceState *dev) -{ -PCIPCNetState *d = DO_UPCAST(PCIPCNetState, pci_dev.qdev, dev); - -pcnet_h_reset(d-state); +ret = pcnet_common_init(pci_dev-qdev, s, net_pci_pcnet_info); +pci_reset(pci_dev-qdev); +return ret; } static PCIDeviceInfo pcnet_info = { diff --git a/hw/rtl8139.c b/hw/rtl8139.c index a22530c..15425a8 100644 --- a/hw/rtl8139.c +++ b/hw/rtl8139.c @@ -3405,6 +3405,7 @@ static int pci_rtl8139_init(PCIDevice *dev) rtl8139_set_next_tctr_time(s, qemu_get_clock(vm_clock)); add_boot_device_path(s-conf.bootindex, dev-qdev, /ethernet-phy@0); +rtl8139_reset(dev-qdev); return 0; } -- 1.7.1
Re: [Qemu-devel] [PATCH 14/22] migration: Remove get_status() accessor
2011/2/23 Juan Quintela quint...@redhat.com: It is only used inside migration.c, and fields on that struct are accessed all around the place on that file. Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 16 +--- migration.h | 1 - 2 files changed, 5 insertions(+), 12 deletions(-) diff --git a/migration.c b/migration.c index dfe6a96..2b873fa 100644 --- a/migration.c +++ b/migration.c @@ -90,7 +90,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) int ret; if (current_migration - current_migration-get_status(current_migration) == MIG_STATE_ACTIVE) { + current_migration-state == MIG_STATE_ACTIVE) { monitor_printf(mon, migration already in progress\n); return -1; } @@ -135,7 +135,7 @@ int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data) { MigrationState *s = current_migration; - if (s s-get_status(s) == MIG_STATE_ACTIVE) + if (s s-state == MIG_STATE_ACTIVE) s-cancel(s); return 0; @@ -234,7 +234,7 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) if (current_migration) { MigrationState *s = current_migration; - switch (s-get_status(current_migration)) { + switch (s-state) { case MIG_STATE_NONE: /* no migration has happened ever */ break; @@ -375,7 +375,7 @@ static void migrate_fd_put_ready(void *opaque) } else { migrate_fd_completed(s); } - if (s-get_status(s) != MIG_STATE_COMPLETED) { + if (s-state != MIG_STATE_COMPLETED) { if (old_vm_running) { vm_start(); } @@ -383,11 +383,6 @@ static void migrate_fd_put_ready(void *opaque) } } -static int migrate_fd_get_status(MigrationState *s) -{ - return s-state; -} - static void migrate_fd_cancel(MigrationState *s) { if (s-state != MIG_STATE_ACTIVE) @@ -442,7 +437,7 @@ void remove_migration_state_change_notifier(Notifier *notify) int get_migration_state(void) { if (current_migration) { - return migrate_fd_get_status(current_migration); + return current_migration-state; } else { return MIG_STATE_ERROR; } @@ -477,7 +472,6 @@ static MigrationState *migrate_create_state(Monitor *mon, int64_t bandwidth_limi MigrationState *s = qemu_mallocz(sizeof(*s)); s-cancel = migrate_fd_cancel; - s-get_status = migrate_fd_get_status; s-blk = blk; s-shared = inc; s-mon = NULL; diff --git a/migration.h b/migration.h index 5455d8b..58a6e06 100644 --- a/migration.h +++ b/migration.h @@ -37,7 +37,6 @@ struct MigrationState int (*close)(MigrationState*); int (*write)(MigrationState*, const void *, size_t); void (*cancel)(MigrationState *s); - int (*get_status)(MigrationState *s); void *opaque; int blk; int shared; I agree to access s-state directly inside of migration.c, but I disagree to remove get_status() accessor right away. We don't have strong motivations for doing that AFAIK. Yoshi -- 1.7.4
Re: [Qemu-devel] [PATCH 2/2 V1] Fixed EPROM for AMD driver compatibility under DOS with Netware driver
On 22 February 2011 21:00, Gerhard Wiesinger li...@wiesinger.com wrote: Hello, No comments? Can someone commit? Not my area of the code, but some general remarks: (1) your patch email seems to be in an odd format with two patches concatenated, the first of which just changes a line introduced in the first. You should fix this and resubmit as a single patch in the right format that makes the changes you want. (2) Style issues: QEMU's coding style is for C-style comments (/* .. */) not C++-style //. Also lines should be 80 characters maximum, so your end-of-line comments should probably be moved. (3) Your commit message could be better. Convention is that the summary (subject) should start with an indication of what area is being patched, eg hw/pcnet.c: Fix EPROM contents to suit AMD netware drivers and then in the body of the commit message you have more detail. Remarks about what you're fixing and what you've tested go here, not in comments like: // bugfix under DOS with AMD netware driver: AMD PCNTNW Ethernet MLID v3.10 (960115), network card not found // works well under DOS with AMD NDIS driver v2.0.1, Knoppix 6.2 ok If you fix these cosmetic issues you'll have a patch which is easier to review and more likely to be applied. -- PMM
Re: [Qemu-devel] [PATCH 21/22] migration: Export a function that tells if the migration has finished correctly
2011/2/23 Juan Quintela quint...@redhat.com: This will allows us to hide the status values. Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 4 ++-- migration.h | 2 +- ui/spice-core.c | 4 +--- 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/migration.c b/migration.c index 312a029..383ebaf 100644 --- a/migration.c +++ b/migration.c @@ -335,9 +335,9 @@ void remove_migration_state_change_notifier(Notifier *notify) notifier_list_remove(migration_state_notifiers, notify); } -int get_migration_state(void) +bool migration_has_finished(void) { - return current_migration.state; + return current_migration.state == MIG_STATE_COMPLETED; } void migrate_fd_connect(MigrationState *s) diff --git a/migration.h b/migration.h index 6477b51..9457807 100644 --- a/migration.h +++ b/migration.h @@ -82,6 +82,6 @@ void migrate_fd_connect(MigrationState *s); void add_migration_state_change_notifier(Notifier *notify); void remove_migration_state_change_notifier(Notifier *notify); -int get_migration_state(void); +bool migration_has_finished(void); #endif diff --git a/ui/spice-core.c b/ui/spice-core.c index 1aa1a5e..997603d 100644 --- a/ui/spice-core.c +++ b/ui/spice-core.c @@ -422,9 +422,7 @@ void do_info_spice(Monitor *mon, QObject **ret_data) static void migration_state_notifier(Notifier *notifier) { - int state = get_migration_state(); - - if (state == MIG_STATE_COMPLETED) { + if (migration_has_finished()) { #if SPICE_SERVER_VERSION = 0x000701 /* 0.7.1 */ spice_server_migrate_switch(spice_server); #endif I agree to add migration_has_finished, but I don't see why we want to remove get_migration_state. Are we going to make migration_has_* for each state even migration gets complicated? Yoshi -- 1.7.4
Re: [Qemu-devel] [PATCH 19/22] migration: convert current_migration from pointer to struct
2011/2/23 Juan Quintela quint...@redhat.com: This cleans up a lot the code as we don't have to check anymore if the variable is NULL or not. Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 119 -- 1 files changed, 49 insertions(+), 70 deletions(-) diff --git a/migration.c b/migration.c index 4014330..7b1e679 100644 --- a/migration.c +++ b/migration.c @@ -34,7 +34,9 @@ /* Migration speed throttling */ static int64_t max_throttle = (32 20); -static MigrationState *current_migration; +static MigrationState current_migration = { + .state = MIG_STATE_NONE, +}; static NotifierList migration_state_notifiers = NOTIFIER_LIST_INITIALIZER(migration_state_notifiers); @@ -135,37 +137,34 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) { QDict *qdict; - if (current_migration) { - - switch (current_migration-state) { - case MIG_STATE_NONE: - /* no migration has happened ever */ - break; - case MIG_STATE_ACTIVE: - qdict = qdict_new(); - qdict_put(qdict, status, qstring_from_str(active)); - - migrate_put_status(qdict, ram, ram_bytes_transferred(), - ram_bytes_remaining(), ram_bytes_total()); - - if (blk_mig_active()) { - migrate_put_status(qdict, disk, blk_mig_bytes_transferred(), - blk_mig_bytes_remaining(), - blk_mig_bytes_total()); - } - - *ret_data = QOBJECT(qdict); - break; - case MIG_STATE_COMPLETED: - *ret_data = qobject_from_jsonf({ 'status': 'completed' }); - break; - case MIG_STATE_ERROR: - *ret_data = qobject_from_jsonf({ 'status': 'failed' }); - break; - case MIG_STATE_CANCELLED: - *ret_data = qobject_from_jsonf({ 'status': 'cancelled' }); - break; + switch (current_migration.state) { + case MIG_STATE_NONE: + /* no migration has happened ever */ + break; + case MIG_STATE_ACTIVE: + qdict = qdict_new(); + qdict_put(qdict, status, qstring_from_str(active)); + + migrate_put_status(qdict, ram, ram_bytes_transferred(), + ram_bytes_remaining(), ram_bytes_total()); + + if (blk_mig_active()) { + migrate_put_status(qdict, disk, blk_mig_bytes_transferred(), + blk_mig_bytes_remaining(), + blk_mig_bytes_total()); } + + *ret_data = QOBJECT(qdict); + break; + case MIG_STATE_COMPLETED: + *ret_data = qobject_from_jsonf({ 'status': 'completed' }); + break; + case MIG_STATE_ERROR: + *ret_data = qobject_from_jsonf({ 'status': 'failed' }); + break; + case MIG_STATE_CANCELLED: + *ret_data = qobject_from_jsonf({ 'status': 'cancelled' }); + break; } } @@ -339,11 +338,7 @@ void remove_migration_state_change_notifier(Notifier *notify) int get_migration_state(void) { - if (current_migration) { - return current_migration-state; - } else { - return MIG_STATE_ERROR; - } + return current_migration.state; } void migrate_fd_connect(MigrationState *s) @@ -369,27 +364,22 @@ void migrate_fd_connect(MigrationState *s) migrate_fd_put_ready(s); } -static MigrationState *migrate_create_state(Monitor *mon, int64_t bandwidth_limit, - int detach, int blk, int inc) +static void migrate_init_state(Monitor *mon, int64_t bandwidth_limit, + int detach, int blk, int inc) { - MigrationState *s = qemu_mallocz(sizeof(*s)); - - s-blk = blk; - s-shared = inc; - s-mon = NULL; - s-bandwidth_limit = bandwidth_limit; - s-state = MIG_STATE_NONE; + current_migration.blk = blk; + current_migration.shared = inc; + current_migration.mon = NULL; + current_migration.bandwidth_limit = bandwidth_limit; + current_migration.state = MIG_STATE_NONE; if (!detach) { - migrate_fd_monitor_suspend(s, mon); + migrate_fd_monitor_suspend(current_migration, mon); } - - return s; } int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) { - MigrationState *s = NULL; const char *p; int detach = qdict_get_try_bool(qdict, detach, 0); int blk = qdict_get_try_bool(qdict, blk, 0); @@ -397,8 +387,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) const char *uri = qdict_get_str(qdict, uri); int ret; - if (current_migration - current_migration-state == MIG_STATE_ACTIVE) { + if (current_migration.state == MIG_STATE_ACTIVE) { monitor_printf(mon, migration already
Re: [Qemu-devel] [PATCH 18/22] migration: another case of global variable assigned to local one
2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/migration.c b/migration.c index accc6e4..4014330 100644 --- a/migration.c +++ b/migration.c @@ -136,9 +136,8 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) QDict *qdict; if (current_migration) { - MigrationState *s = current_migration; - switch (s-state) { + switch (current_migration-state) { case MIG_STATE_NONE: /* no migration has happened ever */ break; Looks good to me. Yoshi -- 1.7.4
[Qemu-devel] Re: [PATCH 00/22] Refactor and cleaup migration code
On 02/23/2011 01:44 AM, Juan Quintela wrote: This series: - Fold MigrationState into FdMigrationState (and then rename) - Factorize migration statec creation in a single place - Make use of MIG_STATE_*, setup through helpers and make them local - remove relase cancel callbacks (where used only one in same file than defined) - get_status() is no more, just access directly to .state - current_migration use cleanup, and make variable static - max_throotle is gone, now inside current_migration - change get_migration_status() to migration_has_finished() and actualize single user. Please review. Some changes are a matter of taste, but overall looks very nice! Reviewed-by: Paolo Bonzini pbonz...@redhat.com Paolo
Re: [Qemu-devel] [PATCH 16/22] migration: Move exported functions to the end of the file
2011/2/23 Juan Quintela quint...@redhat.com: This means we can remove the two forward declarations. Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 188 +-- 1 files changed, 92 insertions(+), 96 deletions(-) diff --git a/migration.c b/migration.c index 92bff01..d7dfe1e 100644 --- a/migration.c +++ b/migration.c @@ -76,90 +76,6 @@ void process_incoming_migration(QEMUFile *f) vm_start(); } -static MigrationState *migrate_create_state(Monitor *mon, int64_t bandwidth_limit, - int detach, int blk, int inc); - -int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) -{ - MigrationState *s = NULL; - const char *p; - int detach = qdict_get_try_bool(qdict, detach, 0); - int blk = qdict_get_try_bool(qdict, blk, 0); - int inc = qdict_get_try_bool(qdict, inc, 0); - const char *uri = qdict_get_str(qdict, uri); - int ret; - - if (current_migration - current_migration-state == MIG_STATE_ACTIVE) { - monitor_printf(mon, migration already in progress\n); - return -1; - } - - if (qemu_savevm_state_blocked(mon)) { - return -1; - } - - s = migrate_create_state(mon, max_throttle, detach, blk, inc); - - if (strstart(uri, tcp:, p)) { - ret = tcp_start_outgoing_migration(s, p); -#if !defined(WIN32) - } else if (strstart(uri, exec:, p)) { - ret = exec_start_outgoing_migration(s, p); - } else if (strstart(uri, unix:, p)) { - ret = unix_start_outgoing_migration(s, p); - } else if (strstart(uri, fd:, p)) { - ret = fd_start_outgoing_migration(s, p); -#endif - } else { - monitor_printf(mon, unknown migration protocol: %s\n, uri); - ret = -EINVAL; - goto free_migrate_state; - } - - if (ret 0) { - monitor_printf(mon, migration failed\n); - goto free_migrate_state; - } - - qemu_free(current_migration); - current_migration = s; - notifier_list_notify(migration_state_notifiers); - return 0; -free_migrate_state: - qemu_free(s); - return -1; -} - -static void migrate_fd_cancel(MigrationState *s); - -int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data) -{ - if (current_migration) - migrate_fd_cancel(current_migration); - - return 0; -} - -int do_migrate_set_speed(Monitor *mon, const QDict *qdict, QObject **ret_data) -{ - int64_t d; - MigrationState *s; - - d = qdict_get_int(qdict, value); - if (d 0) { - d = 0; - } - max_throttle = d; - - s = current_migration; - if (s s-file) { - qemu_file_set_rate_limit(s-file, max_throttle); - } - - return 0; -} - /* amount of nanoseconds we are willing to wait for migration to be down. * the choice of nanoseconds is because it is the maximum resolution that * get_clock() can achieve. It is an internal measure. All user-visible @@ -171,18 +87,6 @@ uint64_t migrate_max_downtime(void) return max_downtime; } -int do_migrate_set_downtime(Monitor *mon, const QDict *qdict, - QObject **ret_data) -{ - double d; - - d = qdict_get_double(qdict, value) * 1e9; - d = MAX(0, MIN(UINT64_MAX, d)); - max_downtime = (uint64_t)d; - - return 0; -} - static void migrate_print_status(Monitor *mon, const char *name, const QDict *status_dict) { @@ -483,3 +387,95 @@ static MigrationState *migrate_create_state(Monitor *mon, int64_t bandwidth_limi return s; } + +int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) +{ + MigrationState *s = NULL; + const char *p; + int detach = qdict_get_try_bool(qdict, detach, 0); + int blk = qdict_get_try_bool(qdict, blk, 0); + int inc = qdict_get_try_bool(qdict, inc, 0); + const char *uri = qdict_get_str(qdict, uri); + int ret; + + if (current_migration + current_migration-state == MIG_STATE_ACTIVE) { + monitor_printf(mon, migration already in progress\n); + return -1; + } + + if (qemu_savevm_state_blocked(mon)) { + return -1; + } + + s = migrate_create_state(mon, max_throttle, detach, blk, inc); + + if (strstart(uri, tcp:, p)) { + ret = tcp_start_outgoing_migration(s, p); +#if !defined(WIN32) + } else if (strstart(uri, exec:, p)) { + ret = exec_start_outgoing_migration(s, p); + } else if (strstart(uri, unix:, p)) { + ret = unix_start_outgoing_migration(s, p); + } else if (strstart(uri, fd:, p)) { + ret = fd_start_outgoing_migration(s, p); +#endif + } else { + monitor_printf(mon, unknown migration protocol: %s\n, uri); + ret = -EINVAL; + goto free_migrate_state; + } + + if (ret 0) { +
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 22.02.2011 19:18, schrieb Anthony Liguori: On 02/22/2011 10:15 AM, Kevin Wolf wrote: Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. It's a minor detail, but flushing and the amount of metadata are separate points. I agree that they are separate... The dirty flag prevents metadata from being flushed to disk very often but the use of a refcount table adds additional metadata. A refcount table is definitely not required even if you claim the requirement exists for other features. I assume you mean to implement trim/discard support but instead of a refcount table, a free list would work just as well and would leave the metadata update out of the fast path (allocating writes) and instead only be in the slow path (trim/discard). ...but here you're arguing about writing metadata out in the fast path, so you're actually not interested in the amount of metadata but in the overhead of flushing it. Which is a problem that's solved. A refcount table is essential for internal snapshots and compression, it's useful for discard and for running on block devices, it's necessary for avoiding the dirty flag and fsck on startup. These are five use cases that I can enumerate without thinking a lot about it, there might be more. You propose using three different mechanisms for allowing normal allocations (use the file size), block devices (add a size field into the header) and discard (free list), and the other three features, for which you can't think of a hack, you declare misfeatures. I don't think what you're proposing is a satisfactory solution. In my book, a single data structure that can provide all of the features is better than a bunch of independent hacks that allows only half of it. As a format feature, a refcount table really only makes sense if the refcount is required to be greater than a single bit. There are more optimal data structures that can be used if the refcount of a block is fixed to 1-bit (like a free list) which is what the fundamental design difference between qcow2 and qed is. Okay, so even assuming that there's something like misfeatures that we can kick out (with which I strongly disagree), what's the crucial advantage of free lists that would make you switch the image format? That you only access it in the slow path (discard) isn't true, because you certainly want to reallocate freed clusters. Otherwise you could just leak them without maintaining a list of leaked clusters... The only use of a refcount of more than 1-bit is internal snapshots AFAICT. Of the currently implemented features, internal snapshots and compression. Kevin
[Qemu-devel] Re: [PATCH 03/22] migration: Fold MigrationState into FdMigrationState
Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2011/2/23 Juan Quintela quint...@redhat.com: struct FdMigrationState { - MigrationState mig_state; int64_t bandwidth_limit; QEMUFile *file; int fd; @@ -48,7 +35,12 @@ struct FdMigrationState int (*get_error)(struct FdMigrationState*); int (*close)(struct FdMigrationState*); int (*write)(struct FdMigrationState*, const void *, size_t); + void (*cancel)(FdMigrationState *s); + int (*get_status)(FdMigrationState *s); + void (*release)(FdMigrationState *s); void *opaque; + int blk; + int shared; }; Although I don't have objections for folding MigrationState into FdMigrationState, it would be good to ask why the original author split it intentionally. I asked, Anthony answer was that it is a historical artifact. Later, Juan.
[Qemu-devel] Re: [PATCH 08/22] migration: Check that migration is active before cancel it
Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/migration.c b/migration.c index 397a0b9..55f58c8 100644 --- a/migration.c +++ b/migration.c @@ -138,7 +138,7 @@ int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data) { MigrationState *s = current_migration; - if (s) + if (s s-get_status(s) == MIG_STATE_ACTIVE) s-cancel(s); return 0; Why don't you remove *s again? Removed in a next patch. Later, Juan.
Re: [Qemu-devel] [PATCH 02/22] migration: Use FdMigrationState instead of MigrationState when possible
2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 31 ++- migration.h | 16 2 files changed, 22 insertions(+), 25 deletions(-) diff --git a/migration.c b/migration.c index f9aaadc..3a371a3 100644 --- a/migration.c +++ b/migration.c @@ -34,7 +34,7 @@ /* Migration speed throttling */ static int64_t max_throttle = (32 20); -static MigrationState *current_migration; +static FdMigrationState *current_migration; static NotifierList migration_state_notifiers = NOTIFIER_LIST_INITIALIZER(migration_state_notifiers); @@ -86,7 +86,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) const char *uri = qdict_get_str(qdict, uri); if (current_migration - current_migration-get_status(current_migration) == MIG_STATE_ACTIVE) { + current_migration-mig_state.get_status(current_migration) == MIG_STATE_ACTIVE) { monitor_printf(mon, migration already in progress\n); return -1; } @@ -120,20 +120,20 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) } if (current_migration) { - current_migration-release(current_migration); + current_migration-mig_state.release(current_migration); } - current_migration = s-mig_state; + current_migration = s; notifier_list_notify(migration_state_notifiers); return 0; } int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data) { - MigrationState *s = current_migration; + FdMigrationState *s = current_migration; if (s) - s-cancel(s); + s-mig_state.cancel(s); return 0; } @@ -149,7 +149,7 @@ int do_migrate_set_speed(Monitor *mon, const QDict *qdict, QObject **ret_data) } max_throttle = d; - s = migrate_to_fms(current_migration); + s = current_migration; if (s s-file) { qemu_file_set_rate_limit(s-file, max_throttle); } @@ -227,10 +227,11 @@ static void migrate_put_status(QDict *qdict, const char *name, void do_info_migrate(Monitor *mon, QObject **ret_data) { QDict *qdict; - MigrationState *s = current_migration; - if (s) { - switch (s-get_status(s)) { + if (current_migration) { + MigrationState *s = current_migration-mig_state; + + switch (s-get_status(current_migration)) { case MIG_STATE_ACTIVE: qdict = qdict_new(); qdict_put(qdict, status, qstring_from_str(active)); @@ -399,16 +400,13 @@ void migrate_fd_put_ready(void *opaque) } } -int migrate_fd_get_status(MigrationState *mig_state) +int migrate_fd_get_status(FdMigrationState *s) { - FdMigrationState *s = migrate_to_fms(mig_state); return s-state; } -void migrate_fd_cancel(MigrationState *mig_state) +void migrate_fd_cancel(FdMigrationState *s) { - FdMigrationState *s = migrate_to_fms(mig_state); - if (s-state != MIG_STATE_ACTIVE) return; @@ -421,9 +419,8 @@ void migrate_fd_cancel(MigrationState *mig_state) migrate_fd_cleanup(s); } -void migrate_fd_release(MigrationState *mig_state) +void migrate_fd_release(FdMigrationState *s) { - FdMigrationState *s = migrate_to_fms(mig_state); DPRINTF(releasing state\n); diff --git a/migration.h b/migration.h index db0e45a..f49a9e2 100644 --- a/migration.h +++ b/migration.h @@ -25,18 +25,18 @@ typedef struct MigrationState MigrationState; +typedef struct FdMigrationState FdMigrationState; + struct MigrationState { /* FIXME: add more accessors to print migration info */ - void (*cancel)(MigrationState *s); - int (*get_status)(MigrationState *s); - void (*release)(MigrationState *s); + void (*cancel)(FdMigrationState *s); + int (*get_status)(FdMigrationState *s); + void (*release)(FdMigrationState *s); int blk; int shared; }; -typedef struct FdMigrationState FdMigrationState; - struct FdMigrationState { MigrationState mig_state; @@ -120,11 +120,11 @@ void migrate_fd_connect(FdMigrationState *s); void migrate_fd_put_ready(void *opaque); -int migrate_fd_get_status(MigrationState *mig_state); +int migrate_fd_get_status(FdMigrationState *mig_state); -void migrate_fd_cancel(MigrationState *mig_state); +void migrate_fd_cancel(FdMigrationState *mig_state); -void migrate_fd_release(MigrationState *mig_state); +void migrate_fd_release(FdMigrationState *mig_state); void migrate_fd_wait_for_unfreeze(void *opaque); Looks good to me. Yoshi -- 1.7.4
[Qemu-devel] Re: [PATCH 14/22] migration: Remove get_status() accessor
Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2011/2/23 Juan Quintela quint...@redhat.com: It is only used inside migration.c, and fields on that struct are accessed all around the place on that file. I agree to access s-state directly inside of migration.c, but I disagree to remove get_status() accessor right away. We don't have strong motivations for doing that AFAIK. Only user outside of migration.c was ui/spice-core.c, and it just wanted to know if migration has finished at all. At this point I was trying to isolate what other parts of MigrationState are used externally. That way, it gets easier to change that later. At this point, only things used outside of migration.c are: - write, clase, get_error: trivial to fix, just add setters for them. - fd: that is not enterely trivial to fix. Later, Juan.
[Qemu-devel] Re: [PATCH 22/22] migration: Make state definitions local
Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 6 ++ migration.h | 6 -- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/migration.c b/migration.c index 383ebaf..90fc2a0 100644 --- a/migration.c +++ b/migration.c @@ -31,6 +31,12 @@ do { } while (0) #endif +#define MIG_STATE_ERROR -1 +#define MIG_STATE_NONE 0 +#define MIG_STATE_CANCELLED 1 +#define MIG_STATE_ACTIVE 2 +#define MIG_STATE_COMPLETED 3 + static MigrationState current_migration = { .state = MIG_STATE_NONE, /* Migration speed throttling */ diff --git a/migration.h b/migration.h index 9457807..493fbe5 100644 --- a/migration.h +++ b/migration.h @@ -18,12 +18,6 @@ #include qemu-common.h #include notify.h -#define MIG_STATE_ERROR -1 -#define MIG_STATE_NONE 0 -#define MIG_STATE_CANCELLED 1 -#define MIG_STATE_ACTIVE 2 -#define MIG_STATE_COMPLETED 3 - Although you're right, I would prefer to keep it so that somebody outside of migration may understand the status in the future if there are no harms. my plan is to move MigrationState inside migration.c, and then decide what to export/not export. Next thing to do is move migration to its own thread. Before doing that, I need to know what parts are used/not used outside migration.c. Removing it now means that nothing gets to use it without needing a patch. Later, Juan..
Re: [Qemu-devel] Building QEMU on PS3
Hi, Roy Seems to be related: http://www.mail-archive.com/qemu-devel@nongnu.org/msg55914.html Thanks for the info. I wonder if it is possible to cross compile qemu for ppc (host) on a x86 machine. Regards, chenwj -- Wei-Ren Chen (陳韋任) Parallel Processing Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667
Re: [Qemu-devel] [PATCH 06/22] migration: Make all posible migration functions static
2011/2/23 Juan Quintela quint...@redhat.com: I have to move two functions postions to avoid forward declarations Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 72 +- migration.h | 12 - 2 files changed, 36 insertions(+), 48 deletions(-) diff --git a/migration.c b/migration.c index e773806..1853380 100644 --- a/migration.c +++ b/migration.c @@ -273,15 +273,7 @@ static void migrate_fd_monitor_suspend(MigrationState *s, Monitor *mon) } } -void migrate_fd_error(MigrationState *s) -{ - DPRINTF(setting error state\n); - s-state = MIG_STATE_ERROR; - notifier_list_notify(migration_state_notifiers); - migrate_fd_cleanup(s); -} - -int migrate_fd_cleanup(MigrationState *s) +static int migrate_fd_cleanup(MigrationState *s) { int ret = 0; @@ -308,7 +300,15 @@ int migrate_fd_cleanup(MigrationState *s) return ret; } -void migrate_fd_put_notify(void *opaque) +void migrate_fd_error(MigrationState *s) +{ + DPRINTF(setting error state\n); + s-state = MIG_STATE_ERROR; + notifier_list_notify(migration_state_notifiers); + migrate_fd_cleanup(s); +} + +static void migrate_fd_put_notify(void *opaque) { MigrationState *s = opaque; @@ -316,7 +316,7 @@ void migrate_fd_put_notify(void *opaque) qemu_file_put_notify(s-file); } -ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) +static ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) { MigrationState *s = opaque; ssize_t ret; @@ -341,29 +341,7 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) return ret; } -void migrate_fd_connect(MigrationState *s) -{ - int ret; - - s-file = qemu_fopen_ops_buffered(s, - s-bandwidth_limit, - migrate_fd_put_buffer, - migrate_fd_put_ready, - migrate_fd_wait_for_unfreeze, - migrate_fd_close); - - DPRINTF(beginning savevm\n); - ret = qemu_savevm_state_begin(s-mon, s-file, s-blk, s-shared); - if (ret 0) { - DPRINTF(failed, %d\n, ret); - migrate_fd_error(s); - return; - } - - migrate_fd_put_ready(s); -} - -void migrate_fd_put_ready(void *opaque) +static void migrate_fd_put_ready(void *opaque) { MigrationState *s = opaque; @@ -431,7 +409,7 @@ static void migrate_fd_release(MigrationState *s) qemu_free(s); } -void migrate_fd_wait_for_unfreeze(void *opaque) +static void migrate_fd_wait_for_unfreeze(void *opaque) { MigrationState *s = opaque; int ret; @@ -450,7 +428,7 @@ void migrate_fd_wait_for_unfreeze(void *opaque) } while (ret == -1 (s-get_error(s)) == EINTR); } -int migrate_fd_close(void *opaque) +static int migrate_fd_close(void *opaque) { MigrationState *s = opaque; @@ -477,6 +455,28 @@ int get_migration_state(void) } } +void migrate_fd_connect(MigrationState *s) +{ + int ret; + + s-file = qemu_fopen_ops_buffered(s, + s-bandwidth_limit, + migrate_fd_put_buffer, + migrate_fd_put_ready, + migrate_fd_wait_for_unfreeze, + migrate_fd_close); + + DPRINTF(beginning savevm\n); + ret = qemu_savevm_state_begin(s-mon, s-file, s-blk, s-shared); + if (ret 0) { + DPRINTF(failed, %d\n, ret); + migrate_fd_error(s); + return; + } + + migrate_fd_put_ready(s); +} + MigrationState *migrate_create_state(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, int inc) { diff --git a/migration.h b/migration.h index 0178414..048ee46 100644 --- a/migration.h +++ b/migration.h @@ -100,20 +100,8 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, void migrate_fd_error(MigrationState *s); -int migrate_fd_cleanup(MigrationState *s); - -void migrate_fd_put_notify(void *opaque); - -ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size); - void migrate_fd_connect(MigrationState *s); -void migrate_fd_put_ready(void *opaque); - -void migrate_fd_wait_for_unfreeze(void *opaque); - -int migrate_fd_close(void *opaque); - MigrationState *migrate_create_state(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, int inc); Looks good to me. Yoshi -- 1.7.4
Re: [Qemu-devel] [PATCH] `qdev_free` when unplug a pci device
Hi Markus, On Wed, Feb 23, 2011 at 9:30 AM, Markus Armbruster arm...@redhat.com wrote: I don't think this patch is correct. Let me explain. Device hot unplug is *not* guaranteed to succeed. For some buses, such as USB, it always succeeds immediately, i.e. when the device_del monitor command finishes, the device is gone. Live is good. But for PCI, device_del merely initiates the ACPI unplug rain dance. It doesn't wait for the dance to complete. Why? The dance can take an unpredictable amount of time, including forever. Problem: Subsequent device_add can fail if it reuses the qdev ID or PCI slot, and the unplug has not yet completed (race condition), or it failed. Yes, Virginia, PCI hotplug *can* fail. When unplug succeeds, the qdev is automatically destroyed. pciej_write() does that for PIIX4. Looks like pcie_cap_slot_event() does it for PCIE. Your patch adds a *second* qdev_free(). No good. Thanks for the complete explanation. I now understand my mistake and find out why it wasn't working as expected. On a Linux virtual machine I forgot to load the `acpiphp` module which makes the pci device disappear when doing a detach. I thought that, even without this module the detach should have freed the corresponding device on qemu side, regardless if the virtual machine was supporting or not acpi signals. Please ignore my patch. Regards, -- William
Re: [Qemu-devel] [PATCH 09/22] migration: Introduce MIG_STATE_NONE
2011/2/23 Juan Quintela quint...@redhat.com: Use MIG_STATE_ACTIVE only when migration has really started Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 6 +- migration.h | 3 ++- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/migration.c b/migration.c index 55f58c8..f015e02 100644 --- a/migration.c +++ b/migration.c @@ -238,6 +238,9 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) MigrationState *s = current_migration; switch (s-get_status(current_migration)) { + case MIG_STATE_NONE: + /* no migration has happened ever */ + break; case MIG_STATE_ACTIVE: qdict = qdict_new(); qdict_put(qdict, status, qstring_from_str(active)); @@ -465,6 +468,7 @@ void migrate_fd_connect(MigrationState *s) { int ret; + s-state = MIG_STATE_ACTIVE; s-file = qemu_fopen_ops_buffered(s, s-bandwidth_limit, migrate_fd_put_buffer, @@ -495,7 +499,7 @@ static MigrationState *migrate_create_state(Monitor *mon, int64_t bandwidth_limi s-shared = inc; s-mon = NULL; s-bandwidth_limit = bandwidth_limit; - s-state = MIG_STATE_ACTIVE; + s-state = MIG_STATE_NONE; if (!detach) { migrate_fd_monitor_suspend(s, mon); diff --git a/migration.h b/migration.h index 7d28dd3..3df2293 100644 --- a/migration.h +++ b/migration.h @@ -19,9 +19,10 @@ #include notify.h #define MIG_STATE_ERROR -1 -#define MIG_STATE_COMPLETED 0 +#define MIG_STATE_NONE 0 #define MIG_STATE_CANCELLED 1 #define MIG_STATE_ACTIVE 2 +#define MIG_STATE_COMPLETED 3 It may be a good chance to make them enum? Yoshi typedef struct MigrationState MigrationState; -- 1.7.4
[Qemu-devel] Re: [PATCH 0/2] Fix error handling in migration when the peer is killed.
Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: Hi, During live migration, if the receiver side of qemu gets killed, the sender side seems to be handling the error incorrectly, like it passes the iterate phase (stage 2) and moves on to the complete state (stage 3). These patches fix the issue. Agreed. Integrated into my series of cleanups. Thanks.
Re: [Qemu-devel] [PATCH] Remove a detached device from qemu_device_opts.
Hi Minoru, On Tue, Feb 15, 2011 at 3:32 AM, Minoru Usui u...@mxm.nes.nec.co.jp wrote: I can reproduce, too. But strangely, it don't occur in case of loading acpiphp driver to the guest VM on below environment. Host : RHEL6.0 Guest: RHEL5.5 Unfortunately, I'm not familiar with qemu-kvm. I investigated below questions about this problem, but I couldn't resolve them. - How to call qdev_free() asynchronously. (How should we fix this problem) - Why it don't occur with acpiphp driver If anyone knows answer of above questions or its clue, please let me know. If fact this is not a bug. `qdev_free` is called when the acpi detach succeed in `pciej_write`. The virtual machine has to correctly support acpi signals. Please read the explanation from Markus Armbruster on http://lists.nongnu.org/archive/html/qemu-devel/2011-02/msg02637.html Regards, -- William
Re: [Qemu-devel] Re: QEMU regression problems - Update FPU
On Wed, Feb 23, 2011 at 9:16 AM, Peter Maydell peter.mayd...@linaro.org wrote: On 18 February 2011 07:12, Gerhard Wiesinger li...@wiesinger.com wrote: Issue 1.) with FPU still present I tracked down the problematic code and it is a rounding error from double precision to 64bit floats: Any ideas how to fix such an issue in general? QEMU result in ST0: 0.42925860786976457 (wrong emulated) KVM result in ST0: 0.42925860786975449 (correct) This is an error when running QEMU in TCG mode, right? At the moment x86 is the odd-one-out in that it doesn't use CONFIG_SOFTFLOAT for its FPU emulation, so somebody has made an explicit choice of preferring speed over accuracy, and I am unsurprised that there are rounding errors as a result. Even if you were using SoftFloat, you'd probably still get wrong results given that this test seems to test trigonometric instructions which aren't implemented in SoftFloat, but are relying on libm. Laurent
Re: [Qemu-devel] [PATCH 10/22] migration: Refactor and simplify error checking in migrate_fd_put_ready
2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 20 +--- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/migration.c b/migration.c index f015e02..641df9f 100644 --- a/migration.c +++ b/migration.c @@ -361,28 +361,26 @@ static void migrate_fd_put_ready(void *opaque) DPRINTF(iterate\n); if (qemu_savevm_state_iterate(s-mon, s-file) == 1) { - int state; int old_vm_running = vm_running; DPRINTF(done iterating\n); vm_stop(VMSTOP_MIGRATE); - if ((qemu_savevm_state_complete(s-mon, s-file)) 0) { - if (old_vm_running) { - vm_start(); - } - state = MIG_STATE_ERROR; + if (qemu_savevm_state_complete(s-mon, s-file) 0) { + migrate_fd_error(s); } else { - state = MIG_STATE_COMPLETED; + if (migrate_fd_cleanup(s) 0) { + migrate_fd_error(s); + } else { + s-state = MIG_STATE_COMPLETED; + notifier_list_notify(migration_state_notifiers); + } } - if (migrate_fd_cleanup(s) 0) { + if (s-get_status(s) != MIG_STATE_COMPLETED) { if (old_vm_running) { vm_start(); } This part, although it's not fair to ask you, but calling vm_start when != MIG_STATE_COMPLETED terrifies me because just failing migrate_fd_cleanup (mostly calling qemu_fclose) may cause split brain between src/dst. Although I haven't encountered this situation, just having stopped VMs on both sides is safer. Thanks, Yoshi - state = MIG_STATE_ERROR; } - s-state = state; - notifier_list_notify(migration_state_notifiers); } } -- 1.7.4
Re: [Qemu-devel] Building QEMU on PS3
Hi chenwj, 2011/2/23 陳韋任 che...@iis.sinica.edu.tw: Hi, Roy Seems to be related: http://www.mail-archive.com/qemu-devel@nongnu.org/msg55914.html Thanks for the info. I wonder if it is possible to cross compile qemu for ppc (host) on a x86 machine. If you have cross compile toolchain and cross compiled necessary libraries, then it should work. You can also try adding extra swap on PS3 to see if it can compile the sources. http://psubuntu.com/wiki/PSUbuntuGPU http://stason.org/TULARC/os/linux-faq/038-How-Do-I-Add-Temporary-Swap-Space.html Best regards, Roy
Re: [Qemu-devel] Re: [PATCH 0/2] Fix error handling in migration when the peer is killed.
2011/2/23 Juan Quintela quint...@redhat.com: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: Hi, During live migration, if the receiver side of qemu gets killed, the sender side seems to be handling the error incorrectly, like it passes the iterate phase (stage 2) and moves on to the complete state (stage 3). These patches fix the issue. Agreed. Integrated into my series of cleanups. Thanks. Thanks! Yoshi
Re: [Qemu-devel] General IO ports in pc386
Tomas Bures bu...@d3s.mff.cuni.cz writes: Thank you. This is exactly what I was looking for. I've created the device, added the compilation of it to Makefile.target . Now, I'm wondering how to persuade QEMU to initialize it? I've registered it using the code below. The registration executes during QEMU launch, the the initialization doesn't. Please where should I add it? static void ers_io_register_devices(void) { isa_qdev_register(ers_io_info); ers_io_debug(\n); } device_init(ers_io_register_devices); Try starting with -device NAME, where NAME is whatever you put into qdev.name. Your device should then be visible in info qtree.
[Qemu-devel] Re: [PATCH 10/22] migration: Refactor and simplify error checking in migrate_fd_put_ready
Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 20 +--- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/migration.c b/migration.c index f015e02..641df9f 100644 --- a/migration.c +++ b/migration.c @@ -361,28 +361,26 @@ static void migrate_fd_put_ready(void *opaque) DPRINTF(iterate\n); if (qemu_savevm_state_iterate(s-mon, s-file) == 1) { - int state; int old_vm_running = vm_running; DPRINTF(done iterating\n); vm_stop(VMSTOP_MIGRATE); - if ((qemu_savevm_state_complete(s-mon, s-file)) 0) { - if (old_vm_running) { - vm_start(); - } - state = MIG_STATE_ERROR; + if (qemu_savevm_state_complete(s-mon, s-file) 0) { + migrate_fd_error(s); } else { - state = MIG_STATE_COMPLETED; + if (migrate_fd_cleanup(s) 0) { + migrate_fd_error(s); + } else { + s-state = MIG_STATE_COMPLETED; + notifier_list_notify(migration_state_notifiers); + } } - if (migrate_fd_cleanup(s) 0) { + if (s-get_status(s) != MIG_STATE_COMPLETED) { if (old_vm_running) { vm_start(); } This part, although it's not fair to ask you, but calling vm_start when != MIG_STATE_COMPLETED terrifies me because just failing migrate_fd_cleanup (mostly calling qemu_fclose) may cause split brain between src/dst. Although I haven't encountered this situation, just having stopped VMs on both sides is safer. I see your pain. I am not happy at all, but this was integrated by Anthony to fix this bug: commit 41ef56e61153d7bd27d34a634633bb51b1c5988d Author: Anthony Liguori aligu...@us.ibm.com Date: Wed Jun 2 14:55:25 2010 -0500 migration: respect exit status with exec: This fixes https://bugs.launchpad.net/qemu/+bug/391879 I think that it fixes that bug, but it makes me un-easy to restart vm if there is a failure in migrate_fd_cleanup(). As I didn't wanted to change behaviour with this series, I left it as it was. Next on ToDo list is to do something sensible with errors, just now we are not very good at handling them. Later, Juan.
[Qemu-devel] Re: [PATCH 00/22] Refactor and cleaup migration code
On 2011-02-23 01:44, Juan Quintela wrote: This series: - Fold MigrationState into FdMigrationState (and then rename) - Factorize migration statec creation in a single place - Make use of MIG_STATE_*, setup through helpers and make them local - remove relase cancel callbacks (where used only one in same file than defined) - get_status() is no more, just access directly to .state - current_migration use cleanup, and make variable static - max_throotle is gone, now inside current_migration - change get_migration_status() to migration_has_finished() and actualize single user. Please review. Tried checkpatch.pl? :) I can only recommend to include it in personal preparation scripts for patch series. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
[Qemu-devel] Re: [PATCH 0/4] Improve -icount, fix it with iothread
On Mon, Feb 21, 2011 at 09:51:22AM +0100, Paolo Bonzini wrote: This series redoes the way time spent waiting for I/O is accounted to the vm_clock. The current code is advancing qemu_icount before waiting for I/O. Instead, after the patch qemu_icount is left aside (it is a pure instruction counter) and qemu_icount_bias is changed according to the actual amount of time spent in the wait. This is more accurate, and actually works in the iothread case as well. (I started this as an experiment while trying to understand what was going on. But it fixes the bug and does not break the no-iothread case, so hey...). Patch 1 is a cleanup to Edgar's commit 225d02c (Avoid deadlock whith iothread and icount, 2011-01-23). Thanks, this one was a good cleanup, I've applied it. Patch 2 fixes another misunderstanding in the role of qemu_next_deadline. Patches 3 and 4 implement the actual new accounting algorithm. Sorry, I don't know the code well enough to give any sensible feedback on patch 2 - 4. I did test them with some of my guests and things seem to be OK with them but quite a bit slower. I saw around 10 - 20% slowdown with a cris guest and -icount 10. The slow down might be related to the issue with super slow icount together with iothread (adressed by Marcelos iothread timeout patch). With these patches, iothread -icount N doesn't work when the actual execution speed cannot keep up with the requested speed; the execution in that case is not deterministic. It works when the requested speed is slow enough. Sorry, would you mind explaning this a bit? For example, if I have a machine and guest sw that does no IO. It runs the CPU and only uses devices that use the virtual time (e.g timers and peripherals that compute stuff). Can I expect the guest (with fixed icount speed -icount N) to run deterministically regardless of host speed? Cheers
[Qemu-devel] Re: [PATCH 0/4] Improve -icount, fix it with iothread
On 02/23/2011 11:18 AM, Edgar E. Iglesias wrote: Sorry, I don't know the code well enough to give any sensible feedback on patch 2 - 4. I did test them with some of my guests and things seem to be OK with them but quite a bit slower. I saw around 10 - 20% slowdown with a cris guest and -icount 10. The slow down might be related to the issue with super slow icount together with iothread (adressed by Marcelos iothread timeout patch). No, this supersedes Marcelo's patch. 10-20% doesn't seem comparable to looks like it deadlocked anyway. Also, Jan has ideas on how to remove the synchronization overhead in the main loop for TCG+iothread. Have you tested without iothread too, both to check the speed and to ensure this introduces no regressions? With these patches, iothread -icount N doesn't work when the actual execution speed cannot keep up with the requested speed; the execution in that case is not deterministic. It works when the requested speed is slow enough. Sorry, would you mind explaning this a bit? -icount 0 doesn't give 1000 BogoMIPS unless the machine is fast enough to sustain it. But that's a bug. These patches are meant to be a start. For example, if I have a machine and guest sw that does no IO. It runs the CPU and only uses devices that use the virtual time (e.g timers and peripherals that compute stuff). Can I expect the guest (with fixed icount speed -icount N) to run deterministically regardless of host speed? Right now, only if the N is large enough for the host machine to sustain that speed. Paolo
[Qemu-devel] Re: [PATCH 0/4] Improve -icount, fix it with iothread
On Wed, Feb 23, 2011 at 11:25:54AM +0100, Paolo Bonzini wrote: On 02/23/2011 11:18 AM, Edgar E. Iglesias wrote: Sorry, I don't know the code well enough to give any sensible feedback on patch 2 - 4. I did test them with some of my guests and things seem to be OK with them but quite a bit slower. I saw around 10 - 20% slowdown with a cris guest and -icount 10. The slow down might be related to the issue with super slow icount together with iothread (adressed by Marcelos iothread timeout patch). No, this supersedes Marcelo's patch. 10-20% doesn't seem comparable to looks like it deadlocked anyway. Also, Jan has ideas on how to remove the synchronization overhead in the main loop for TCG+iothread. I see. I tried booting two of my MIPS and CRIS linux guests with iothread and -icount 4. Without your patch, the boot crawls super slow. Your patch gives a huge improvement. This was the deadlock scenario which I mentioned in previous emails. Just to clarify the previous test where I saw slowdown with your patch: A CRIS setup that has a CRIS and basically only two peripherals, a timer block and a device (X) that computes stuff but delays the results with a virtual timer. The guest CPU is 99% of the time just busy-waiting for device X to get ready. This latter test runs in 3.7s with icount 4 and without iothread, with or without your patch. With icount 4 and iothread it runs in ~1m5s without your patch and ~1m20s with your patch. That was the 20% slowdown I mentioned earlier. Don't know if that info helps... Have you tested without iothread too, both to check the speed and to ensure this introduces no regressions? I tried now, I see no regression without iothread. With these patches, iothread -icount N doesn't work when the actual execution speed cannot keep up with the requested speed; the execution in that case is not deterministic. It works when the requested speed is slow enough. Sorry, would you mind explaning this a bit? -icount 0 doesn't give 1000 BogoMIPS unless the machine is fast enough to sustain it. But that's a bug. These patches are meant to be a start. For example, if I have a machine and guest sw that does no IO. It runs the CPU and only uses devices that use the virtual time (e.g timers and peripherals that compute stuff). Can I expect the guest (with fixed icount speed -icount N) to run deterministically regardless of host speed? Right now, only if the N is large enough for the host machine to sustain that speed. OK thanks. Cheers
[Qemu-devel] [PATCH 2/7] introduce libcacard/vscard_common.h
--- Signed-off-by: Alon Levy al...@redhat.com v19-v20 changes: * checkpatch.pl v15-v16 changes: Protocol change: * VSCMsgInit capabilities and magic * removed ReaderResponse, will use Error instead with code==VSC_SUCCESS. * adaded Flush and FlushComplete, remove Reconnect. * define VSCARD_MAGIC * added error code VSC_SUCCESS. Fixes: * update VSCMsgInit comment * fix message type enum * remove underscore from wrapping define * update copyright * updated comments. * Header comment updated * remove C++ style comment * fix comment for VSCMsgError * give names to enums in typedefs --- libcacard/vscard_common.h | 167 + 1 files changed, 167 insertions(+), 0 deletions(-) create mode 100644 libcacard/vscard_common.h diff --git a/libcacard/vscard_common.h b/libcacard/vscard_common.h new file mode 100644 index 000..7449314 --- /dev/null +++ b/libcacard/vscard_common.h @@ -0,0 +1,167 @@ +/* Virtual Smart Card protocol definition + * + * This protocol is between a host using virtual smart card readers, + * and a client providing the smart cards, perhaps by emulating them or by + * access to real cards. + * + * Definitions for this protocol: + * Host - user of the card + * Client - owner of the card + * + * The current implementation passes the raw APDU's from 7816 and additionally + * contains messages to setup and teardown readers, handle insertion and + * removal of cards, negotiate the protocol via capabilities and provide + * for error responses. + * + * Copyright (c) 2011 Red Hat. + * + * This code is licensed under the LGPL. + */ + +#ifndef VSCARD_COMMON_H +#define VSCARD_COMMON_H + +#include stdint.h + +#define VERSION_MAJOR_BITS 11 +#define VERSION_MIDDLE_BITS 11 +#define VERSION_MINOR_BITS 10 + +#define MAKE_VERSION(major, middle, minor) \ + ((major (VERSION_MINOR_BITS + VERSION_MIDDLE_BITS)) \ + | (middle VERSION_MINOR_BITS) \ + | (minor)) + +/** IMPORTANT NOTE on VERSION + * + * The version below MUST be changed whenever a change in this file is made. + * + * The last digit, the minor, is for bug fix changes only. + * + * The middle digit is for backward / forward compatible changes, updates + * to the existing messages, addition of fields. + * + * The major digit is for a breaking change of protocol, presumably + * something that cannot be accomodated with the existing protocol. + */ + +#define VSCARD_VERSION MAKE_VERSION(0, 0, 2) + +typedef enum VSCMsgType { +VSC_Init = 1, +VSC_Error, +VSC_ReaderAdd, +VSC_ReaderRemove, +VSC_ATR, +VSC_CardRemove, +VSC_APDU, +VSC_Flush, +VSC_FlushComplete +} VSCMsgType; + +typedef enum VSCErrorCode { +VSC_SUCCESS = 0, +VSC_GENERAL_ERROR = 1, +VSC_CANNOT_ADD_MORE_READERS, +VSC_CARD_ALREAY_INSERTED, +} VSCErrorCode; + +#define VSCARD_UNDEFINED_READER_ID 0x +#define VSCARD_MINIMAL_READER_ID0 + +#define VSCARD_MAGIC (*(uint32_t *)VSCD) + +/* Header + * Each message starts with the header. + * type - message type + * reader_id - used by messages that are reader specific + * length - length of payload (not including header, i.e. zero for + * messages containing empty payloads) + */ +typedef struct VSCMsgHeader { +uint32_t type; +uint32_t reader_id; +uint32_t length; +uint8_tdata[0]; +} VSCMsgHeader; + +/* VSCMsgInit Client - Host + * Client sends it on connection, with its own capabilities. + * Host replies with VSCMsgInit filling in its capabilities. + * + * It is not meant to be used for negotiation, i.e. sending more then + * once from any side, but could be used for that in the future. + * */ +typedef struct VSCMsgInit { +uint32_t magic; +uint32_t version; +uint32_t capabilities[1]; /* receiver must check length, + array may grow in the future*/ +} VSCMsgInit; + +/* VSCMsgError Client - Host + * This message is a response to any of: + * Reader Add + * Reader Remove + * Card Remove + * If the operation was successful then VSC_SUCCESS + * is returned, other wise a specific error code. + * */ +typedef struct VSCMsgError { +uint32_t code; +} VSCMsgError; + +/* VSCMsgReaderAdd Client - Host + * Host replies with allocated reader id in VSCMsgError with code==SUCCESS. + * + * name - name of the reader on client side, UTF-8 encoded. Only used + * for client presentation (may be translated to the device presented to the + * guest), protocol wise only reader_id is important. + * */ +typedef struct VSCMsgReaderAdd { +uint8_tname[0]; +} VSCMsgReaderAdd; + +/* VSCMsgReaderRemove Client - Host + * The client's reader has been removed. + * */ +typedef struct VSCMsgReaderRemove { +} VSCMsgReaderRemove; + +/* VSCMsgATRClient - Host + * Answer to reset. Sent for card insertion or card reset. The reset/insertion + * happens on the client side, they do not require any action from the
[Qemu-devel] [PATCH v20 0/7] usb-ccid
This patchset adds three new devices, usb-ccid, ccid-card-passthru and ccid-card-emulated, providing a CCID bus, a simple passthru protocol implementing card requiring a client, and a standalone emulated card. It also introduces a new directory libcaccard with CAC card emulation, CAC is a type of ISO 7816 smart card. Tree for pull: git://anongit.freedesktop.org/~alon/qemu usb_ccid.v20 v19-v20 changes: * checkpatch.pl. Here are the remaining errors with explanation: * ignored 5 macro errors of the type ERROR: Macros with complex values should be enclosed in parenthesis because fixing them breaks current code, if it really bothers someone I can fix it. * four of them are in libcacard/card_7816t.h: /* give the subfields a unified look */ .. #define a_cla a_header-ah_cla /* class */ #define a_ins a_header-ah_ins /* instruction */ #define a_p1 a_header-ah_p1 /* parameter 1 */ #define a_p2 a_header-ah_p2 /* parameter 2 */ * and the fifth: #4946: FILE: libcacard/vcardt.h:31: +#define VCARD_ATR_PREFIX(size) 0x3b, 0x66+(size), 0x00, 0xff, \ + 'V', 'C', 'A', 'R', 'D', '_' * Ignored this warning since I couldn't figure it out, and it's a test file: WARNING: externs should be avoided in .c files #2343: FILE: libcacard/link_test.c:7: +VCardStatus cac_card_init(const char *flags, VCard *card, v18-v19 changes: * more merges, down to a single digit number of patches. * drop enumeration property, use string. * rebased (trivial) v17-v18 changes: * merge vscard_common.h patches. * actually provide a tree to pull. v16-v17 changes: * merged all the v15-v16 patches * merged some more wherever it was easy (all same file commits). * added signed off by to first four patches * ccid.h: added copyright, removed underscore in defines, and replaced non C89 comments v15-v16 changes: * split vscard_common introducing patch for ease of review * sum of commit logs for the v15-v16 commits: (whitespace fixes removed for space, see original commit messages in later patches) * usb-ccid: * fix abort on client answer after card remove * enable migration * remove side affect code from asserts * return consistent self-powered state * mask out reserved bits in ccid_set_parameters * add missing abRFU in SetParameters (no affect on linux guest) * vscard_common.h protocol change: * VSCMsgInit capabilities and magic * removed ReaderResponse, will use Error instead with code==VSC_SUCCESS. * added Flush and FlushComplete, remove Reconnect. * define VSCARD_MAGIC * added error code VSC_SUCCESS. * ccid-card-passthru * return correct size * return error instead of assert if client sent too large ATR * don't assert if client sent too large a size, but add asserts for indices to buffer * reset vscard_in indices on chardev disconnect * handle init from client * error if no chardev supplied * use ntoh, hton * eradicate reader_id_t * remove Reconnect usage (removed from VSCARD protocol) * send VSC_SUCCESS on card insert/remove and reader add/remove * ccid-card-emulated * fix error reporting in initfn v14-v15 changes: * add patch with --enable-smartcard and --disable-smartcard and only disable ccid-card-emulated if nss not found. * add patch with description strings * s/libcaccard/libcacard/ in docs/ccid.txt v13-v14 changes: - support device_del/device_add on ccid-card-* and usb-ccid * usb-ccid: * lose card reference when card device deleted * check slot number and deny adding a slot if one is already added. * ccid-card-*: use qdev_simple_unplug_cb in both emulated and passthru ccid cards, the exitfn already takes care of triggering card removal in the usb dev. * libcacard: * remove double include of config-host.mak * add replay of card events to libcacard to support second and more emulation * don't initialize more then once (doesn't support it right now, so one thread, NSS thread, is left when device_del is done) * add VCARD_EMUL_INIT_ALREADY_INITED * ccid-card-emulated: * take correct mutexes on signaling to fix deadlocks on device_del * allow card insertion/removal event without proper reader insertion event v12-v13 changes: * libcacard: * fix Makefile clean to remove vscclient * fix double include of config-host in Makefile * usb-ccid: remove attach/detach logic, usb is always attached. Guest doesn't care if there is a reader attached with no card anyway. * ccid-card-passthru: don't close chr_dev on removal, makes it possible to use device_del/device_add to create remove/insertion for debugging. v11-v12 changes: * fix out of tree build v10-v11 changes: * fix last patch that removed one of the doc files. * updated flow table in docs/ccid.txt v8-v10 changes: * usb-ccid: * add slot for future use (Gerd) * ifdef ENABLE_MIGRATION for migration support on account of usb migration not being ready in general. (Gerd) * verbosified commit messages.
[Qemu-devel] [PATCH 6/7] ccid: add docs
Add documentation for the usb-ccid device and accompanying two card devices, ccid-card-emulated and ccid-card-passthru. Signed-off-by: Alon Levy al...@redhat.com --- docs/ccid.txt | 135 + 1 files changed, 135 insertions(+), 0 deletions(-) create mode 100644 docs/ccid.txt diff --git a/docs/ccid.txt b/docs/ccid.txt new file mode 100644 index 000..b8e504a --- /dev/null +++ b/docs/ccid.txt @@ -0,0 +1,135 @@ +Qemu CCID Device Documentation. + +Contents +1. USB CCID device +2. Building +3. Using ccid-card-emulated with hardware +4. Using ccid-card-emulated with certificates +5. Using ccid-card-passthru with client side hardware +6. Using ccid-card-passthru with client side certificates +7. Passthrough protocol scenario +8. libcacard + +1. USB CCID device + +The USB CCID device is a USB device implementing the CCID specification, which +lets one connect smart card readers that implement the same spec. For more +information see the specification: + + Universal Serial Bus + Device Class: Smart Card + CCID + Specification for + Integrated Circuit(s) Cards Interface Devices + Revision 1.1 + April 22rd, 2005 + +Smartcard are used for authentication, single sign on, decryption in +public/private schemes and digital signatures. A smartcard reader on the client +cannot be used on a guest with simple usb passthrough since it will then not be +available on the client, possibly locking the computer when it is removed. On +the other hand this device can let you use the smartcard on both the client and +the guest machine. It is also possible to have a completely virtual smart card +reader and smart card (i.e. not backed by a physical device) using this device. + +2. Building + +The cryptographic functions and access to the physical card is done via NSS. + +Installing NSS: + +In redhat/fedora: +yum install nss-devel +In ubuntu/debian: +apt-get install libnss3-dev +(not tested on ubuntu) + +Configuring and building: +./configure --enable-smartcard make + +3. Using ccid-card-emulated with hardware + +Assuming you have a working smartcard on the host with the current +user, using NSS, qemu acts as another NSS client using ccid-card-emulated: + +qemu -usb -device usb-ccid -device ccid-card-emualated + +4. Using ccid-card-emulated with certificates + +You must create the certificates. This is a one time process. We use NSS +certificates: + +certutil -d /etc/pki/nssdb -x -t CT,CT,CT -S -s CN=cert1 -n cert1 + +Note: you must have exactly three certificates. + +Assuming the current user can access the certificates (use certutil -L to +verify), you can use the emulated card type with the certificates backend: + +qemu -usb -device usb-ccid -device ccid-card-emulated,backend=certificates,cert1=cert1,cert2=cert2,cert3=cert3 + +5. Using ccid-card-passthru with client side hardware + +on the host specify the ccid-card-passthru device with a suitable chardev: + +qemu -chardev socket,server,host=0.0.0.0,port=2001,id=ccid,nowait -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid + +on the client run vscclient, built when you built the libcacard library: +libcacard/vscclient qemu-host 2001 + +6. Using ccid-card-passthru with client side certificates + +Run qemu as per #5, and run vscclient as follows: +(Note: vscclient command line interface is in a state of change) + +libcacard/vscclient -e db=\/etc/pki/nssdb\ use_hw=no soft=(,Test,CAC,,cert1,cert2,cert3) qemu-host 2001 + +7. Passthrough protocol scenario + +This is a typical interchange of messages when using the passthru card device. +usb-ccid is a usb device. It defaults to an unattached usb device on startup. +usb-ccid expects a chardev and expects the protocol defined in +cac_card/vscard_common.h to be passed over that. +The usb-ccid device can be in one of three modes: + * detached + * attached with no card + * attached with card + +A typical interchange is: (the arrow shows who started each exchange, it can be client +originated or guest originated) + +client event | vscclient |passthru| usb-ccid | guest event +-- + | VSC_Init|| | + | VSC_ReaderAdd || attach| + | || | sees new usb device. +card inserted - | || | + | VSC_ATR | insert | insert | see new card + | || | + | VSC_APDU| VSC_APDU | | - guest sends APDU +client-physical | || | +card APDU exchange|
Re: [Qemu-devel] Re: [PATCH 10/22] migration: Refactor and simplify error checking in migrate_fd_put_ready
2011/2/23 Juan Quintela quint...@redhat.com: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp wrote: 2011/2/23 Juan Quintela quint...@redhat.com: Signed-off-by: Juan Quintela quint...@redhat.com --- migration.c | 20 +--- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/migration.c b/migration.c index f015e02..641df9f 100644 --- a/migration.c +++ b/migration.c @@ -361,28 +361,26 @@ static void migrate_fd_put_ready(void *opaque) DPRINTF(iterate\n); if (qemu_savevm_state_iterate(s-mon, s-file) == 1) { - int state; int old_vm_running = vm_running; DPRINTF(done iterating\n); vm_stop(VMSTOP_MIGRATE); - if ((qemu_savevm_state_complete(s-mon, s-file)) 0) { - if (old_vm_running) { - vm_start(); - } - state = MIG_STATE_ERROR; + if (qemu_savevm_state_complete(s-mon, s-file) 0) { + migrate_fd_error(s); } else { - state = MIG_STATE_COMPLETED; + if (migrate_fd_cleanup(s) 0) { + migrate_fd_error(s); + } else { + s-state = MIG_STATE_COMPLETED; + notifier_list_notify(migration_state_notifiers); + } } - if (migrate_fd_cleanup(s) 0) { + if (s-get_status(s) != MIG_STATE_COMPLETED) { if (old_vm_running) { vm_start(); } This part, although it's not fair to ask you, but calling vm_start when != MIG_STATE_COMPLETED terrifies me because just failing migrate_fd_cleanup (mostly calling qemu_fclose) may cause split brain between src/dst. Although I haven't encountered this situation, just having stopped VMs on both sides is safer. I see your pain. I am not happy at all, but this was integrated by Anthony to fix this bug: commit 41ef56e61153d7bd27d34a634633bb51b1c5988d Author: Anthony Liguori aligu...@us.ibm.com Date: Wed Jun 2 14:55:25 2010 -0500 migration: respect exit status with exec: This fixes https://bugs.launchpad.net/qemu/+bug/391879 Thanks for the link. I don't know IIUC, why stopping the VM was a problem? The essential thing is that we need to introduce a flag that whether user wants to continue a VM when something goes wrong during live migration. Deciding only with old_vm_running is wrong. I think that it fixes that bug, but it makes me un-easy to restart vm if there is a failure in migrate_fd_cleanup(). As I didn't wanted to change behaviour with this series, I left it as it was. I agree with keeping the behavior unchanged. Next on ToDo list is to do something sensible with errors, just now we are not very good at handling them. Yeah. If we introduce Kemari, the migration code becomes more important because it'll be part of the normal VM execution path :) Thanks, Yoshi Later, Juan.
[Qemu-devel] Re: [PATCH 0/4] Improve -icount, fix it with iothread
On 2011-02-23 12:08, Edgar E. Iglesias wrote: On Wed, Feb 23, 2011 at 11:25:54AM +0100, Paolo Bonzini wrote: On 02/23/2011 11:18 AM, Edgar E. Iglesias wrote: Sorry, I don't know the code well enough to give any sensible feedback on patch 2 - 4. I did test them with some of my guests and things seem to be OK with them but quite a bit slower. I saw around 10 - 20% slowdown with a cris guest and -icount 10. The slow down might be related to the issue with super slow icount together with iothread (adressed by Marcelos iothread timeout patch). No, this supersedes Marcelo's patch. 10-20% doesn't seem comparable to looks like it deadlocked anyway. Also, Jan has ideas on how to remove the synchronization overhead in the main loop for TCG+iothread. I see. I tried booting two of my MIPS and CRIS linux guests with iothread and -icount 4. Without your patch, the boot crawls super slow. Your patch gives a huge improvement. This was the deadlock scenario which I mentioned in previous emails. Just to clarify the previous test where I saw slowdown with your patch: A CRIS setup that has a CRIS and basically only two peripherals, a timer block and a device (X) that computes stuff but delays the results with a virtual timer. The guest CPU is 99% of the time just busy-waiting for device X to get ready. This latter test runs in 3.7s with icount 4 and without iothread, with or without your patch. With icount 4 and iothread it runs in ~1m5s without your patch and ~1m20s with your patch. That was the 20% slowdown I mentioned earlier. Don't know if that info helps... You should try to trace the event flow in qemu, either via strace, via the built-in tracer (which likely requires a bit more tracepoints), or via a system-level tracer (ftrace / kernelshark). Did my patches contribute a bit to overhead reduction? They specifically target the costly vcpu/iothread switches in TCG mode (caused by TCGs excessive lock-holding times). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
[Qemu-devel] [PATCH] tools: Use real async.c instead of stubs
It's wrong to call BHs directly, even in tools. The only operations that schedule BHs are called in a loop that (indirectly) contains a call to qemu_bh_poll anyway, so we're not losing the scheduled BHs: Tools either use synchronous functions, which are guaranteed to have completed (including any BHs) when they return; or if they use asynchronous functions, they need to call qemu_aio_wait() or similar functions already today. Signed-off-by: Kevin Wolf kw...@redhat.com --- Makefile.objs |4 ++-- qemu-tool.c | 47 --- 2 files changed, 6 insertions(+), 45 deletions(-) diff --git a/Makefile.objs b/Makefile.objs index c144df1..fc369dc 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -13,7 +13,7 @@ oslib-obj-$(CONFIG_POSIX) += oslib-posix.o ### # block-obj-y is code used by both qemu system emulation and qemu-img -block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o +block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o async.o block-obj-y += nbd.o block.o aio.o aes.o qemu-config.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o @@ -63,7 +63,7 @@ common-obj-y = $(block-obj-y) blockdev.o common-obj-y += $(net-obj-y) common-obj-y += $(qobject-obj-y) common-obj-$(CONFIG_LINUX) += $(fsdev-obj-$(CONFIG_LINUX)) -common-obj-y += readline.o console.o cursor.o async.o qemu-error.o +common-obj-y += readline.o console.o cursor.o qemu-error.o common-obj-y += $(oslib-obj-y) common-obj-$(CONFIG_WIN32) += os-win32.o common-obj-$(CONFIG_POSIX) += os-posix.o diff --git a/qemu-tool.c b/qemu-tool.c index 392e1c9..d45840d 100644 --- a/qemu-tool.c +++ b/qemu-tool.c @@ -56,53 +56,10 @@ void monitor_print_filename(Monitor *mon, const char *filename) { } -void async_context_push(void) -{ -} - -void async_context_pop(void) -{ -} - -int get_async_context_id(void) -{ -return 0; -} - void monitor_protocol_event(MonitorEvent event, QObject *data) { } -QEMUBH *qemu_bh_new(QEMUBHFunc *cb, void *opaque) -{ -QEMUBH *bh; - -bh = qemu_malloc(sizeof(*bh)); -bh-cb = cb; -bh-opaque = opaque; - -return bh; -} - -int qemu_bh_poll(void) -{ -return 0; -} - -void qemu_bh_schedule(QEMUBH *bh) -{ -bh-cb(bh-opaque); -} - -void qemu_bh_cancel(QEMUBH *bh) -{ -} - -void qemu_bh_delete(QEMUBH *bh) -{ -qemu_free(bh); -} - int qemu_set_fd_handler2(int fd, IOCanReadHandler *fd_read_poll, IOHandler *fd_read, @@ -111,3 +68,7 @@ int qemu_set_fd_handler2(int fd, { return 0; } + +void qemu_notify_event(void) +{ +} -- 1.7.2.3
[Qemu-devel] [PATCH 5/7] ccid: add ccid-card-emulated device
This devices uses libcacard (internal) to emulate a smartcard conforming to the CAC standard. It attaches to the usb-ccid bus. Usage instructions (example command lines) are in the following patch in docs/ccid.txt. It uses libcacard which uses nss, so it can work with both hw cards and certificates (files). Signed-off-by: Alon Levy al...@redhat.com --- changes from v19-v20: * checkpatch.pl changes from v18-v19: * add qdev.desc * backend: drop the enumeration property, back to using a string one. changes from v16-v17: * use PROP_TYPE_ENUM for backend changes from v15-v16: * fix error reporting in initfn * bump copyright year * update copyright license changes from v1: * remove stale comments, use only c-style comments * bugfix, forgot to set recv_len * change reader name to 'Virtual Reader' --- Makefile.objs |2 +- hw/ccid-card-emulated.c | 599 +++ 2 files changed, 600 insertions(+), 1 deletions(-) create mode 100644 hw/ccid-card-emulated.c diff --git a/Makefile.objs b/Makefile.objs index b5e001e..f0933f3 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -197,7 +197,7 @@ hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_HPET) += hpet.o hw-obj-$(CONFIG_APPLESMC) += applesmc.o -hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o ccid-card-passthru.o +hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o ccid-card-passthru.o ccid-card-emulated.o # PPC devices hw-obj-$(CONFIG_OPENPIC) += openpic.o diff --git a/hw/ccid-card-emulated.c b/hw/ccid-card-emulated.c new file mode 100644 index 000..bd84d45 --- /dev/null +++ b/hw/ccid-card-emulated.c @@ -0,0 +1,599 @@ +/* + * CCID Card Device. Emulated card. + * + * Copyright (c) 2011 Red Hat. + * Written by Alon Levy. + * + * This code is licenced under the GNU LGPL, version 2 or later. + */ + +/* + * It can be used to provide access to the local hardware in a non exclusive + * way, or it can use certificates. It requires the usb-ccid bus. + * + * Usage 1: standard, mirror hardware reader+card: + * qemu .. -usb -device usb-ccid -device ccid-card-emulated + * + * Usage 2: use certificates, no hardware required + * one time: create the certificates: + * for i in 1 2 3; do + * certutil -d /etc/pki/nssdb -x -t CT,CT,CT -S -s CN=user$i -n user$i + * done + * qemu .. -usb -device usb-ccid \ + * -device ccid-card-emulated,cert1=user1,cert2=user2,cert3=user3 + * + * If you use a non default db for the certificates you can specify it using + * the db parameter. + */ + +#include pthread.h +#include eventt.h +#include vevent.h +#include vreader.h +#include vcard_emul.h +#include qemu-char.h +#include monitor.h +#include hw/ccid.h + +#define DPRINTF(card, lvl, fmt, ...) \ +do {\ +if (lvl = card-debug) {\ +printf(ccid-card-emul: %s: fmt , __func__, ## __VA_ARGS__);\ +} \ +} while (0) + +#define EMULATED_DEV_NAME ccid-card-emulated + +#define BACKEND_NSS_EMULATED_NAME nss-emulated +#define BACKEND_CERTIFICATES_NAME certificates + +enum { +BACKEND_NSS_EMULATED = 1, +BACKEND_CERTIFICATES +}; + +#define DEFAULT_BACKEND BACKEND_NSS_EMULATED + +typedef struct EmulatedState EmulatedState; + +enum { +EMUL_READER_INSERT = 0, +EMUL_READER_REMOVE, +EMUL_CARD_INSERT, +EMUL_CARD_REMOVE, +EMUL_GUEST_APDU, +EMUL_RESPONSE_APDU, +EMUL_ERROR, +}; + +static const char *emul_event_to_string(uint32_t emul_event) +{ +switch (emul_event) { +case EMUL_READER_INSERT: return EMUL_READER_INSERT; +case EMUL_READER_REMOVE: return EMUL_READER_REMOVE; +case EMUL_CARD_INSERT: return EMUL_CARD_INSERT; +case EMUL_CARD_REMOVE: return EMUL_CARD_REMOVE; +case EMUL_GUEST_APDU: return EMUL_GUEST_APDU; +case EMUL_RESPONSE_APDU: return EMUL_RESPONSE_APDU; +case EMUL_ERROR: return EMUL_ERROR; +default: +break; +} +return UNKNOWN; +} + +typedef struct EmulEvent { +QSIMPLEQ_ENTRY(EmulEvent) entry; +union { +struct { +uint32_t type; +} gen; +struct { +uint32_t type; +uint64_t code; +} error; +struct { +uint32_t type; +uint32_t len; +uint8_t data[]; +} data; +} p; +} EmulEvent; + +#define MAX_ATR_SIZE 40 +struct EmulatedState { +CCIDCardState base; +uint8_t debug; +char*backend_str; +uint32_t backend; +char*cert1; +char*cert2; +char*cert3; +char*db; +uint8_t atr[MAX_ATR_SIZE]; +uint8_t atr_length; +QSIMPLEQ_HEAD(event_list, EmulEvent) event_list; +pthread_mutex_t event_list_mutex; +VReader *reader; +QSIMPLEQ_HEAD(guest_apdu_list, EmulEvent) guest_apdu_list; +pthread_mutex_t vreader_mutex; /* and guest_apdu_list mutex */ +pthread_mutex_t handle_apdu_mutex; +pthread_cond_t handle_apdu_cond; +int pipe[2]; +int quit_apdu_thread; +pthread_mutex_t apdu_thread_quit_mutex; +
[Qemu-devel] [PATCH 7/7] ccid: configure: improve --enable-smartcard flags
* add --enable-smartcard and --disable-smartcard flags * let the nss check only disable building the ccid-card-emulated device * report only if nss is found or not, not smartcard build inclusion * don't link with NSS if --disable-smartcard-nss Signed-off-by: Alon Levy al...@redhat.com --- Makefile.objs |3 ++- Makefile.target |2 +- configure | 55 ++- 3 files changed, 45 insertions(+), 15 deletions(-) diff --git a/Makefile.objs b/Makefile.objs index f0933f3..bba862c 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -197,7 +197,8 @@ hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_HPET) += hpet.o hw-obj-$(CONFIG_APPLESMC) += applesmc.o -hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o ccid-card-passthru.o ccid-card-emulated.o +hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o ccid-card-passthru.o +hw-obj-$(CONFIG_SMARTCARD_NSS) += ccid-card-emulated.o # PPC devices hw-obj-$(CONFIG_OPENPIC) += openpic.o diff --git a/Makefile.target b/Makefile.target index 1b744be..0951ba0 100644 --- a/Makefile.target +++ b/Makefile.target @@ -340,7 +340,7 @@ obj-y += $(addprefix $(HWDIR)/, $(hw-obj-y)) endif # CONFIG_SOFTMMU -obj-y += $(addprefix ../libcacard/, $(libcacard-$(CONFIG_SMARTCARD))) +obj-y += $(addprefix ../libcacard/, $(libcacard-$(CONFIG_SMARTCARD_NSS))) obj-y += $(addprefix ../, $(trace-obj-y)) obj-$(CONFIG_GDBSTUB_XML) += gdbstub-xml.o diff --git a/configure b/configure index e2d0e9d..4f19613 100755 --- a/configure +++ b/configure @@ -174,7 +174,8 @@ trace_backend=nop trace_file=trace spice= rbd= -smartcard=yes +smartcard= +smartcard_nss= # parse CC options first for opt do @@ -720,6 +721,14 @@ for opt do ;; --enable-rbd) rbd=yes ;; + --disable-smartcard) smartcard=no + ;; + --enable-smartcard) smartcard=yes + ;; + --disable-smartcard-nss) smartcard_nss=no + ;; + --enable-smartcard-nss) smartcard_nss=yes + ;; *) echo ERROR: unknown option $opt; show_help=yes ;; esac @@ -915,6 +924,10 @@ echoDefault:trace-pid echo --disable-spice disable spice echo --enable-spice enable spice echo --enable-rbd enable building the rados block device (rbd) +echo --disable-smartcard disable smartcard support +echo --enable-smartcard enable smartcard support +echo --disable-smartcard-nss disable smartcard nss support +echo --enable-smartcard-nss enable smartcard nss support echo echo NOTE: The object files are built at the place where configure is launched exit 1 @@ -2292,16 +2305,28 @@ EOF fi # check for libcacard for smartcard support -smartcard_cflags=-I\$(SRC_PATH)/libcacard -libcacard_libs=$($pkgconfig --libs nss 2/dev/null) -libcacard_cflags=$($pkgconfig --cflags nss) -# TODO - what's the minimal nss version we support? -if $pkgconfig --atleast-version=3.12.8 nss; then +if test $smartcard != no ; then smartcard=yes -QEMU_CFLAGS=$QEMU_CFLAGS $smartcard_cflags $libcacard_cflags -LIBS=$libcacard_libs $LIBS -else -smartcard=no +smartcard_cflags= +# TODO - what's the minimal nss version we support? +if test $smartcard_nss != no; then +if $pkg_config --atleast-version=3.12.8 nss /dev/null 21 ; then +smartcard_nss=yes +smartcard_cflags=-I\$(SRC_PATH)/libcacard +libcacard_libs=$($pkg_config --libs nss 2/dev/null) +libcacard_cflags=$($pkg_config --cflags nss 2/dev/null) +QEMU_CFLAGS=$QEMU_CFLAGS $smartcard_cflags $libcacard_cflags +LIBS=$libcacard_libs $LIBS +else +if test $smartcard_nss == yes; then +feature_not_found nss +fi +smartcard_nss=no +fi +fi +fi +if test $smartcard == no ; then +smartcard_nss=no fi ## @@ -2537,7 +2562,7 @@ echo Trace output file $trace_file-pid echo spice support $spice echo rbd support $rbd echo xfsctl support$xfs -echo smartcard support $smartcard +echo nss used $smartcard_nss if test $sdl_too_old = yes; then echo - Your SDL version is too old - please upgrade to have SDL support @@ -2823,6 +2848,10 @@ if test $smartcard = yes ; then echo CONFIG_SMARTCARD=y $config_host_mak fi +if test $smartcard_nss = yes ; then + echo CONFIG_SMARTCARD_NSS=y $config_host_mak +fi + # XXX: suppress that if [ $bsd = yes ] ; then echo CONFIG_BSD=y $config_host_mak @@ -3165,7 +3194,7 @@ fi if test $target_darwin_user = yes ; then echo CONFIG_DARWIN_USER=y $config_target_mak fi -if test $smartcard = yes ; then +if test $smartcard_nss = yes ; then echo subdir-$target: subdir-libcacard $config_host_mak echo libcacard_libs=$libcacard_libs $config_host_mak echo libcacard_cflags=$libcacard_cflags $config_host_mak @@ -3383,7 +3412,7 @@ for hwlib in 32 64; do echo
[Qemu-devel] [PATCH 3/7] ccid: add passthru card device
The passthru ccid card is a device sitting on the usb-ccid bus and using a chardevice to communicate with a remote device using the VSCard protocol defined in libcacard/vscard_common.h Usage docs available in following patch in docs/ccid.txt Signed-off-by: Alon Levy al...@redhat.com --- Changes from v19-v20: * checkpatch.pl Changes from v18-v19: * add qdev.desc * remove .qdev.unplug (no hot unplug support for ccid bus) Changes from v16-v17: * fix wrong cast when receiving VSC_Error * ccid-card-passthru: force chardev user wakeup by sending Init see lengthy comment below. Changes from v15-v16: Behavioral changes: * return correct size * return error instead of assert if client sent too large ATR * don't assert if client sent too large a size, but add asserts for indices to buffer * reset vscard_in indices on chardev disconnect * handle init from client * error if no chardev supplied * use ntoh, hton * eradicate reader_id_t * remove Reconnect usage (removed from VSCARD protocol) * send VSC_SUCCESS on card insert/remove and reader add/remove Style fixes: * width of line fix * update copyright * remove old TODO's * update file header comment * use macros for debug levels * c++ style comment replacement * update copyright license * fix ATR size comment * fix whitespace in struct def * fix DPRINTF prefix * line width fix ccid-card-passthru: force chardev user wakeup by sending Init The problem: how to wakeup the user of the smartcard when the smartcard device is initialized? Long term solution: have a callback interface. This was done via the deprecated so called chardev ioctl interface. Short term solution: do a write. Specifically we write an Init message. And we change the client to send it's own Init message regardless of receiving this one. Additional Init messages will be regarded as acceptable, the first one received after connection establishment is the determining one wrt capabilities. --- Makefile.objs |2 +- hw/ccid-card-passthru.c | 337 +++ 2 files changed, 338 insertions(+), 1 deletions(-) create mode 100644 hw/ccid-card-passthru.c diff --git a/Makefile.objs b/Makefile.objs index 414d206..a3ac45a 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -197,7 +197,7 @@ hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_HPET) += hpet.o hw-obj-$(CONFIG_APPLESMC) += applesmc.o -hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o +hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o ccid-card-passthru.o # PPC devices hw-obj-$(CONFIG_OPENPIC) += openpic.o diff --git a/hw/ccid-card-passthru.c b/hw/ccid-card-passthru.c new file mode 100644 index 000..ad2adb7 --- /dev/null +++ b/hw/ccid-card-passthru.c @@ -0,0 +1,337 @@ +/* + * CCID Passthru Card Device emulation + * + * Copyright (c) 2011 Red Hat. + * Written by Alon Levy. + * + * This code is licenced under the GNU LGPL, version 2 or later. + */ + +#include arpa/inet.h + +#include qemu-char.h +#include monitor.h +#include hw/ccid.h +#include libcacard/vscard_common.h + +#define DPRINTF(card, lvl, fmt, ...)\ +do {\ +if (lvl = card-debug) { \ +printf(ccid-card-passthru: fmt , ## __VA_ARGS__); \ +} \ +} while (0) + +#define D_WARN 1 +#define D_INFO 2 +#define D_MORE_INFO 3 +#define D_VERBOSE 4 + +/* TODO: do we still need this? */ +uint8_t DEFAULT_ATR[] = { +/* From some example somewhere + 0x3B, 0xB0, 0x18, 0x00, 0xD1, 0x81, 0x05, 0xB1, 0x40, 0x38, 0x1F, 0x03, 0x28 + */ + +/* From an Athena smart card */ + 0x3B, 0xD5, 0x18, 0xFF, 0x80, 0x91, 0xFE, 0x1F, 0xC3, 0x80, 0x73, 0xC8, 0x21, + 0x13, 0x08 +}; + + +#define PASSTHRU_DEV_NAME ccid-card-passthru +#define VSCARD_IN_SIZE 65536 + +/* maximum size of ATR - from 7816-3 */ +#define MAX_ATR_SIZE40 + +typedef struct PassthruState PassthruState; + +struct PassthruState { +CCIDCardState base; +CharDriverState *cs; +uint8_t vscard_in_data[VSCARD_IN_SIZE]; +uint32_t vscard_in_pos; +uint32_t vscard_in_hdr; +uint8_t atr[MAX_ATR_SIZE]; +uint8_t atr_length; +uint8_t debug; +}; + +/* VSCard protocol over chardev + * This code should not depend on the card type. + * */ + +static void ccid_card_vscard_send_msg(PassthruState *s, +VSCMsgType type, uint32_t reader_id, +const uint8_t *payload, uint32_t length) +{ +VSCMsgHeader scr_msg_header; + +scr_msg_header.type = htonl(type); +scr_msg_header.reader_id = htonl(reader_id); +scr_msg_header.length = htonl(length); +qemu_chr_write(s-cs, (uint8_t *)scr_msg_header, sizeof(VSCMsgHeader)); +qemu_chr_write(s-cs, payload, length); +} + +static void ccid_card_vscard_send_apdu(PassthruState *s, +const uint8_t *apdu, uint32_t length) +{ +ccid_card_vscard_send_msg( +s, VSC_APDU,
[Qemu-devel] [PATCH 1/7] usb-ccid: add CCID bus
A CCID device is a smart card reader. It is a USB device, defined at [1]. This patch introduces the usb-ccid device that is a ccid bus. Next patches will introduce two card types to use it, a passthru card and an emulated card. [1] http://www.usb.org/developers/devclass_docs/DWG_Smart-Card_CCID_Rev110. Signed-off-by: Alon Levy al...@redhat.com --- changes from v19-v20: * checkpatch.pl changes from v18-v19: * merged: ccid.h: add copyright, fix define and remove non C89 comments * add qdev.desc changes from v15-v16: Behavioral changes: * fix abort on client answer after card remove * enable migration * remove side affect code from asserts * return consistent self-powered state * mask out reserved bits in ccid_set_parameters * add missing abRFU in SetParameters (no affect on linux guest) whitefixes / comments / consts defines: * remove stale comment * remove ccid_print_pending_answers if no DEBUG_CCID * replace printf's with DPRINTF, remove DEBUG_CCID, add verbosity defines * use error_report * update copyright (most of the code is not original) * reword known bug comment * add missing closing quote in comment * add missing whitespace on one line * s/CCID_SetParameter/CCID_SetParameters/ * add comments * use define for max packet size Comment for return consistent self-powered state: the Configuration Descriptor bmAttributes claims we are self powered, but we were returning not self powered to USB_REQ_GET_STATUS control message. In practice, this message is not sent by a linux 2.6.35.10-74.fc14.x86_64 guest (not tested on other guests), unless you issue lsusb -v as root (for example). --- Makefile.objs |1 + configure |6 + hw/ccid.h | 54 +++ hw/usb-ccid.c | 1391 + 4 files changed, 1452 insertions(+), 0 deletions(-) create mode 100644 hw/ccid.h create mode 100644 hw/usb-ccid.c diff --git a/Makefile.objs b/Makefile.objs index c144df1..414d206 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -197,6 +197,7 @@ hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_HPET) += hpet.o hw-obj-$(CONFIG_APPLESMC) += applesmc.o +hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o # PPC devices hw-obj-$(CONFIG_OPENPIC) += openpic.o diff --git a/configure b/configure index 791b71d..147aab3 100755 --- a/configure +++ b/configure @@ -174,6 +174,7 @@ trace_backend=nop trace_file=trace spice= rbd= +smartcard=yes # parse CC options first for opt do @@ -2523,6 +2524,7 @@ echo Trace output file $trace_file-pid echo spice support $spice echo rbd support $rbd echo xfsctl support$xfs +echo smartcard support $smartcard if test $sdl_too_old = yes; then echo - Your SDL version is too old - please upgrade to have SDL support @@ -2804,6 +2806,10 @@ if test $spice = yes ; then echo CONFIG_SPICE=y $config_host_mak fi +if test $smartcard = yes ; then + echo CONFIG_SMARTCARD=y $config_host_mak +fi + # XXX: suppress that if [ $bsd = yes ] ; then echo CONFIG_BSD=y $config_host_mak diff --git a/hw/ccid.h b/hw/ccid.h new file mode 100644 index 000..4350bc2 --- /dev/null +++ b/hw/ccid.h @@ -0,0 +1,54 @@ +/* + * CCID Passthru Card Device emulation + * + * Copyright (c) 2011 Red Hat. + * Written by Alon Levy. + * + * This code is licenced under the GNU LGPL, version 2 or later. + */ + +#ifndef CCID_H +#define CCID_H + +#include qdev.h + +typedef struct CCIDCardState CCIDCardState; +typedef struct CCIDCardInfo CCIDCardInfo; + +/* state of the CCID Card device (i.e. hw/ccid-card-*.c) + */ +struct CCIDCardState { +DeviceState qdev; +uint32_tslot; /* For future use with multiple slot reader. */ +}; + +/* callbacks to be used by the CCID device (hw/usb-ccid.c) to call + * into the smartcard device (hw/ccid-card-*.c) + */ +struct CCIDCardInfo { +DeviceInfo qdev; +void (*print)(Monitor *mon, CCIDCardState *card, int indent); +const uint8_t *(*get_atr)(CCIDCardState *card, uint32_t *len); +void (*apdu_from_guest)(CCIDCardState *card, +const uint8_t *apdu, +uint32_t len); +int (*exitfn)(CCIDCardState *card); +int (*initfn)(CCIDCardState *card); +}; + +/* API for smartcard calling the CCID device (used by hw/ccid-card-*.c) + */ +void ccid_card_send_apdu_to_guest(CCIDCardState *card, + uint8_t *apdu, + uint32_t len); +void ccid_card_card_removed(CCIDCardState *card); +void ccid_card_card_inserted(CCIDCardState *card); +void ccid_card_card_error(CCIDCardState *card, uint64_t error); +void ccid_card_qdev_register(CCIDCardInfo *card); + +/* support guest visible insertion/removal of ccid devices based on actual + * devices connected/removed. Called by card implementation (passthru, local) */ +int ccid_card_ccid_attach(CCIDCardState *card); +void ccid_card_ccid_detach(CCIDCardState *card); + +#endif /* CCID_H */ diff --git
[Qemu-devel] Re: [PATCH 0/4] Improve -icount, fix it with iothread
On Wed, Feb 23, 2011 at 12:39:52PM +0100, Jan Kiszka wrote: On 2011-02-23 12:08, Edgar E. Iglesias wrote: On Wed, Feb 23, 2011 at 11:25:54AM +0100, Paolo Bonzini wrote: On 02/23/2011 11:18 AM, Edgar E. Iglesias wrote: Sorry, I don't know the code well enough to give any sensible feedback on patch 2 - 4. I did test them with some of my guests and things seem to be OK with them but quite a bit slower. I saw around 10 - 20% slowdown with a cris guest and -icount 10. The slow down might be related to the issue with super slow icount together with iothread (adressed by Marcelos iothread timeout patch). No, this supersedes Marcelo's patch. 10-20% doesn't seem comparable to looks like it deadlocked anyway. Also, Jan has ideas on how to remove the synchronization overhead in the main loop for TCG+iothread. I see. I tried booting two of my MIPS and CRIS linux guests with iothread and -icount 4. Without your patch, the boot crawls super slow. Your patch gives a huge improvement. This was the deadlock scenario which I mentioned in previous emails. Just to clarify the previous test where I saw slowdown with your patch: A CRIS setup that has a CRIS and basically only two peripherals, a timer block and a device (X) that computes stuff but delays the results with a virtual timer. The guest CPU is 99% of the time just busy-waiting for device X to get ready. This latter test runs in 3.7s with icount 4 and without iothread, with or without your patch. With icount 4 and iothread it runs in ~1m5s without your patch and ~1m20s with your patch. That was the 20% slowdown I mentioned earlier. Don't know if that info helps... You should try to trace the event flow in qemu, either via strace, via the built-in tracer (which likely requires a bit more tracepoints), or via a system-level tracer (ftrace / kernelshark). Thanks, I'll see if I can get some time to run this more carefully during some weekend. Did my patches contribute a bit to overhead reduction? They specifically target the costly vcpu/iothread switches in TCG mode (caused by TCGs excessive lock-holding times). Do you have a tree for quick access to your patches? (couldnt find them on my inbox). I could give them a quick go and post results. Cheers
[Qemu-devel] Re: [PATCH 0/4] Improve -icount, fix it with iothread
On 02/23/2011 12:08 PM, Edgar E. Iglesias wrote: No, this supersedes Marcelo's patch. 10-20% doesn't seem comparable to looks like it deadlocked anyway. Also, Jan has ideas on how to remove the synchronization overhead in the main loop for TCG+iothread. I see. I tried booting two of my MIPS and CRIS linux guests with iothread and -icount 4. Without your patch, the boot crawls super slow. Your patch gives a huge improvement. This was the deadlock scenario which I mentioned in previous emails. Just to clarify the previous test where I saw slowdown with your patch: A CRIS setup that has a CRIS and basically only two peripherals, a timer block and a device (X) that computes stuff but delays the results with a virtual timer. The guest CPU is 99% of the time just busy-waiting for device X to get ready. This latter test runs in 3.7s with icount 4 and without iothread, with or without your patch. Thanks for testing this. With icount 4 and iothread it runs in ~1m5s without your patch and ~1m20s with your patch. That was the 20% slowdown I mentioned earlier. Ok, so it is in both cases with iothread. We go from 16x slowdown to 19x on one testcase :) and huge improvement on another. (Also, the CRIS images on qemu.org simply hang for me without my patch and numeric icount---and the watchdog triggers---so that's another factor in favor of the patches). I guess we can live with the slowdown for now, if somebody else finds the patch okay. Do you have images for the slow test? Paolo
[Qemu-devel] Re: [PATCH 0/4] Improve -icount, fix it with iothread
On 2011-02-23 13:40, Edgar E. Iglesias wrote: On Wed, Feb 23, 2011 at 12:39:52PM +0100, Jan Kiszka wrote: On 2011-02-23 12:08, Edgar E. Iglesias wrote: On Wed, Feb 23, 2011 at 11:25:54AM +0100, Paolo Bonzini wrote: On 02/23/2011 11:18 AM, Edgar E. Iglesias wrote: Sorry, I don't know the code well enough to give any sensible feedback on patch 2 - 4. I did test them with some of my guests and things seem to be OK with them but quite a bit slower. I saw around 10 - 20% slowdown with a cris guest and -icount 10. The slow down might be related to the issue with super slow icount together with iothread (adressed by Marcelos iothread timeout patch). No, this supersedes Marcelo's patch. 10-20% doesn't seem comparable to looks like it deadlocked anyway. Also, Jan has ideas on how to remove the synchronization overhead in the main loop for TCG+iothread. I see. I tried booting two of my MIPS and CRIS linux guests with iothread and -icount 4. Without your patch, the boot crawls super slow. Your patch gives a huge improvement. This was the deadlock scenario which I mentioned in previous emails. Just to clarify the previous test where I saw slowdown with your patch: A CRIS setup that has a CRIS and basically only two peripherals, a timer block and a device (X) that computes stuff but delays the results with a virtual timer. The guest CPU is 99% of the time just busy-waiting for device X to get ready. This latter test runs in 3.7s with icount 4 and without iothread, with or without your patch. With icount 4 and iothread it runs in ~1m5s without your patch and ~1m20s with your patch. That was the 20% slowdown I mentioned earlier. Don't know if that info helps... You should try to trace the event flow in qemu, either via strace, via the built-in tracer (which likely requires a bit more tracepoints), or via a system-level tracer (ftrace / kernelshark). Thanks, I'll see if I can get some time to run this more carefully during some weekend. Did my patches contribute a bit to overhead reduction? They specifically target the costly vcpu/iothread switches in TCG mode (caused by TCGs excessive lock-holding times). Do you have a tree for quick access to your patches? (couldnt find them on my inbox). http://thread.gmane.org/gmane.comp.emulators.qemu/93765 (looks like I failed to CC you) and they are also part of git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream I could give them a quick go and post results. Cheers Thanks, Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy
On 02/22/2011 11:11 PM, Anthony Liguori wrote: The commit file is considered reliable storage for the result of image switch operation. Think of the following scenario: - mgmt app requests live copy of guests ide1-hd0 from /a/image.img to /b/image.img. - mgmt app dies. - guest switches to /b/image.img, /a/image.img is outdated. - guest dies. Notifying the switch via QMP would not be reliable in this case. But this is true of any number of operations in QEMU such as hotplug where if a management tool dies after requesting hotplug and then the guest dies before the management tool reconnects, it's never known whether it's truly connected or not. The only way to robustly solve this is for QEMU to maintain a stateful config file. Or, the management tool can remember that it tries to hotplug, reconnect, query qemu (or just retry), and proceed according to the result: - insert record in config for the new device, status = inactive - commit - tell qemu to hotplug crash - restart, read database, see inactive device - query qemu - if device is there, mark as active, commit - else, tell qemu to hotplug, when successful, mark as active, commit A one-off for this particular command doesn't seem wise to me. Strongly agreed. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH] tools: Use real async.c instead of stubs
On Wed, Feb 23, 2011 at 11:54 AM, Kevin Wolf kw...@redhat.com wrote: It's wrong to call BHs directly, even in tools. The only operations that schedule BHs are called in a loop that (indirectly) contains a call to qemu_bh_poll anyway, so we're not losing the scheduled BHs: Tools either use synchronous functions, which are guaranteed to have completed (including any BHs) when they return; or if they use asynchronous functions, they need to call qemu_aio_wait() or similar functions already today. Signed-off-by: Kevin Wolf kw...@redhat.com --- Makefile.objs | 4 ++-- qemu-tool.c | 47 --- 2 files changed, 6 insertions(+), 45 deletions(-) We discussed this on IRC and I think it makes sense. Review from others would be appreciated. Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?
On 2011-02-23 07:38, Gerhard Wiesinger wrote: On Mon, 21 Feb 2011, Arnd Bergmann wrote: On Monday 21 February 2011, Jan Kiszka wrote: Now I think I tried all useful possible combinations: 1.) macvtap0 and macvlan0 in bridged and non bridge mode 2.) macvlan0 based on eth0 or based on macvtap0 3.) Using and ip address on macvlan0 or not (tried both for 7b mac and 7c mac) 4.) Using 7b and 7c mac address in KVM (MAC in KVM is always the command line configured) and used tap interface from macvtap0 (as no second tap devices shows up) Summary: 1.) Using macvtap0 only with 7b mac on interface and also at qemu works well in bridged mode as well as non bridged mode but with limitation no guest/host communication possible, used interface in KVM is tap interface created by macvtap0. Quite logically that it doesn't work with guest/host because of missing hairpin mode on the switch. But according to http://virt.kernelnewbies.org/MacVTap bridge mode should work even without hairpinning available on the switch. 2.) macvlan0 doesn't create any tap interface therefore it can't be used as a device on KVM. 3.) Using 7c mac address in KVM doesn't work at all regardsless of setting ip addresses on macvlan0 or any other setup. @Arnd: Can you explain a setup in detail which should work also with host/guest communication. Thnx. Any further ideas? Which kernel is needed to work? (current: 2.6.34.7-56.fc13.x86_64) Actually, I tried this successfully over a 2.6.38-rcX, but I have no idea what changes related to macvlan/tap went in since .34 and if this is required for this. We had a few bugs in .34, but it should work in general. One thing to make sure is that the host has connectivity through the macvlan interface and the lower interface (eth0) has no IP address assigned but is 'up'. It could also be a bug in the NIC, you could try to set the NIC into promiscious mode using ethtool to work around that. Hello, After some further tests and looking at the iproute ip and kernel code I finally gave up because I thing such a setup it is not possible without breaking up/reconfiguring eth0. When I have to reconfigure eth0 I think a better approach is to configure a bridge which I finally did and works well. I tried to explain/document the macvtap/macvlan concepts and limitations below. Please comment on it whether this is true or false. macvtap/macvlan driver concepts and limitations: 1.) macvlan driver adds a MAC address to a lower interface device where the actual macvlanx device is based on 2.) macvtap driver is based on macvlan driver and macvtap driver adds additional functionality of interface = external program communication with stdin/stdout channel. 3.) Limitations: macvtap/macvlan based devices can only communicate with childs based on the same lower device (e.g. eth0 in this sample) but not to the lower device itself, only to the outside world of the interface (I guess limitations are derived from function macvlan_queue_xmit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/net/macvlan.c;h=6ed577b065df136a18af5ee483810773ca4f43f8;hb=HEAD#l217) Example: # Lower device eth0 must not have an IP address assigned, but when also no communication is possible with macv* devices /root/download/iproute2/iproute2-2.6.37/ip/ip link add link eth0 name macvtap0 type macvtap mode bridge /root/download/iproute2/iproute2-2.6.37/ip/ip link set macvtap0 address 1a:46:0b:ca:bc:7b up /root/download/iproute2/iproute2-2.6.37/ip/ip link add link eth0 name macvtap1 type macvtap mode bridge /root/download/iproute2/iproute2-2.6.37/ip/ip link set macvtap1 address 1a:46:0b:ca:bc:7c up /root/download/iproute2/iproute2-2.6.37/ip/ip link add link eth0 name macvlan0 type macvtap mode bridge /root/download/iproute2/iproute2-2.6.37/ip/ip link set macvlan0 address 1a:46:0b:ca:bc:7d up Example usage: eth0 should not be assigned any IP. macvlan0 should be used as the main host interface and the main IP address is configured. macvtap0 and macvtap1 can be used in e.g. VMs. Bridge communication is possible between the 3 interfaces and the outer world: macvtap0, macvtap1, macvlan0 E.g.: 1.) macvtap0 = macvtap1, macvlan0, eth0 external 2.) macvtap1 = macvtap0, macvlan0, eth0 external 3.) macvlan0 = macvtap0, macvtap1, eth0 external But no communication is possible between lower device eth0 and the child interfaces macvtap0, macvtap1, macvlan0: 1.) eth0 = macvtap0 2.) eth0 = macvtap1 3.) eth0 = macvlan0 Right, but if I set IP(eth0) == IP(macvlan0), I'm able to communicate between macvlan0 and mactapX, thus between guest and host. Just re-checked here, still works (after resolving the usual MAC address mess I caused by configuring manually). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy
On 02/23/2011 01:14 AM, Anthony Liguori wrote: -drive already ties into the qemuopts infrastructure and we have readconfig and writeconfig. I don't think we're missing any major pieces to do this in a more proper fashion. The problem with qemu config files is that it splits the authoritative source of where images are stored into two. Is it in the management tool's database or is it in qemu's config file? For the problem at hand, one solution is to make qemu stop after the copy, and then management can issue an additional command to rearrange the disk and resume the guest. A drawback here is that if management dies, the guest is stopped until it restarts. We also make management latency guest visible, even if it doesn't die at an inconvenient place. An alternative approach is to have the copy be performed by a new layered block format driver: - create a new image, type = live-copy, containing three pieces of information - source image - destination image - copy state (initially nothing is copied) - tell qemu switch to the new image - qemu starts copying, updates copy state as needed - copy finishes, event is emitted; reads and writes still serviced - management receives event, switches qemu to destination image - management removes live-copy image If management dies while this is happening, it can simply query the state of the copy. Similarly, if qemu dies, the copy state is persistent (could be 0/1 or real range of blocks). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
Chunqiang Tang ct...@us.ibm.com writes: [...] Now let’s talk about features. It seems that there is great interest in QCOW2’ internal snapshot feature. If we really want to do that, the right Great interest? Its use cases are demo, debugging, testing and such. Kind of useful for developers, but I wouldn't want to use it in anger. Nice to have if we can get it cheaply, but I'm not prepared to pay much for it in performance or complexity, and I doubt I'm the only one. Users always say yes when you ask them whether they need some feature. Hence, the question is useless. A better question to ask is how much are you willing to pay for it? solution is to follow VMDK’s approach of storing each snapshot as a separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the reference count table. VMDK’s approach can be easily implemented for any COW format, or even as a function of the generic block layer, without complicating any COW format or hurting its performance. I know the snapshots are not really “internal” as stored in a single file but instead more like external snapshots, but users don’t care about that so long as they support the same use cases. Probably many people who use VMware don't even know that the snapshots are stored as separate files. Do they care? I certainly wouldn't.
Re: [Qemu-devel] [PATCH] target-arm: Implement a minimal set of cp14 debug registers
On 22 February 2011 18:19, Peter Maydell peter.mayd...@linaro.org wrote: + tmp = tcg_const_i32(0); + store_reg(s, rt, tmp); This generates spurious resource leak warnings, because store_reg() calls dead_tmp() so you can't store anything you didn't get from new_tmp(). I think I'll do what Aurelien suggested and put together a patch that implements the resource-leak debug as a proper facility at the tcg level. -- PMM
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. Compression perhaps not, but if you choose compression, then performance is not your top consideration. That's the case with filesystems that support compression as well. -- error compiling committee.c: too many arguments to function
[Qemu-devel] [PATCH 03/18] Introduce skip_header parameter to qemu_loadvm_state().
Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c |2 +- savevm.c| 24 +--- sysemu.h|2 +- 3 files changed, 15 insertions(+), 13 deletions(-) diff --git a/migration.c b/migration.c index 302b8fe..bd51fef 100644 --- a/migration.c +++ b/migration.c @@ -63,7 +63,7 @@ int qemu_start_incoming_migration(const char *uri) void process_incoming_migration(QEMUFile *f) { -if (qemu_loadvm_state(f) 0) { +if (qemu_loadvm_state(f, 0) 0) { fprintf(stderr, load of migration failed\n); exit(0); } diff --git a/savevm.c b/savevm.c index 22010b9..52d5be8 100644 --- a/savevm.c +++ b/savevm.c @@ -1716,7 +1716,7 @@ typedef struct LoadStateEntry { int version_id; } LoadStateEntry; -int qemu_loadvm_state(QEMUFile *f) +int qemu_loadvm_state(QEMUFile *f, int skip_header) { QLIST_HEAD(, LoadStateEntry) loadvm_handlers = QLIST_HEAD_INITIALIZER(loadvm_handlers); @@ -1729,17 +1729,19 @@ int qemu_loadvm_state(QEMUFile *f) return -EINVAL; } -v = qemu_get_be32(f); -if (v != QEMU_VM_FILE_MAGIC) -return -EINVAL; +if (!skip_header) { +v = qemu_get_be32(f); +if (v != QEMU_VM_FILE_MAGIC) +return -EINVAL; -v = qemu_get_be32(f); -if (v == QEMU_VM_FILE_VERSION_COMPAT) { -fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); -return -ENOTSUP; +v = qemu_get_be32(f); +if (v == QEMU_VM_FILE_VERSION_COMPAT) { +fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); +return -ENOTSUP; +} +if (v != QEMU_VM_FILE_VERSION) +return -ENOTSUP; } -if (v != QEMU_VM_FILE_VERSION) -return -ENOTSUP; while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { uint32_t instance_id, version_id, section_id; @@ -2062,7 +2064,7 @@ int load_vmstate(const char *name) return -EINVAL; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); qemu_fclose(f); if (ret 0) { diff --git a/sysemu.h b/sysemu.h index 0a83ab9..8339eb4 100644 --- a/sysemu.h +++ b/sysemu.h @@ -93,7 +93,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); -int qemu_loadvm_state(QEMUFile *f); +int qemu_loadvm_state(QEMUFile *f, int skip_header); /* SLIRP */ void do_info_slirp(Monitor *mon); -- 1.7.1.2
[Qemu-devel] [PATCH 01/18] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
Currently buf size is fixed at 32KB. It would be useful if it could be flexible. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/hw.h |2 ++ savevm.c | 20 +++- 2 files changed, 21 insertions(+), 1 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index 5e24329..a168a37 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f); int qemu_fclose(QEMUFile *f); void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); +void *qemu_realloc_buffer(QEMUFile *f, int size); +void qemu_clear_buffer(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { diff --git a/savevm.c b/savevm.c index a50fd31..22010b9 100644 --- a/savevm.c +++ b/savevm.c @@ -171,7 +171,8 @@ struct QEMUFile { when reading */ int buf_index; int buf_size; /* 0 when writing */ -uint8_t buf[IO_BUF_SIZE]; +int buf_max_size; +uint8_t *buf; int has_error; }; @@ -422,6 +423,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, f-get_rate_limit = get_rate_limit; f-is_write = 0; +f-buf_max_size = IO_BUF_SIZE; +f-buf = qemu_malloc(sizeof(uint8_t) * f-buf_max_size); + return f; } @@ -452,6 +456,19 @@ void qemu_fflush(QEMUFile *f) } } +void *qemu_realloc_buffer(QEMUFile *f, int size) +{ +f-buf_max_size = size; +f-buf = qemu_realloc(f-buf, f-buf_max_size); + +return f-buf; +} + +void qemu_clear_buffer(QEMUFile *f) +{ +f-buf_size = f-buf_index = f-buf_offset = 0; +} + static void qemu_fill_buffer(QEMUFile *f) { int len; @@ -477,6 +494,7 @@ int qemu_fclose(QEMUFile *f) qemu_fflush(f); if (f-close) ret = f-close(f-opaque); +qemu_free(f-buf); qemu_free(f); return ret; } -- 1.7.1.2
[Qemu-devel] [PATCH 18/18] Introduce kemari: to enable FT migration mode (Kemari).
When kemari: is set in front of URI of migrate command, it will turn on ft_mode to start FT migration mode (Kemari). On the receiver side, the option looks like, -incoming kemari:protocol:address:port Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Acked-by: Paolo Bonzini pbonz...@redhat.com --- hmp-commands.hx |4 +++- migration.c | 12 qmp-commands.hx |4 +++- 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index 372bef4..4588f38 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -760,7 +760,9 @@ ETEXI \n\t\t\t -b for migration without shared storage with full copy of disk\n\t\t\t -i for migration without shared storage with incremental copy of disk - (base image shared between src and destination), + (base image shared between src and destination) + \n\t\t\t put \kemari:\ in front of URI to enable + Fault Tolerance mode (Kemari protocol), .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, diff --git a/migration.c b/migration.c index 82f4a4d..95096ef 100644 --- a/migration.c +++ b/migration.c @@ -48,6 +48,12 @@ int qemu_start_incoming_migration(const char *uri) const char *p; int ret; +/* check ft_mode (Kemari protocol) */ +if (strstart(uri, kemari:, p)) { +ft_mode = FT_INIT; +uri = p; +} + if (strstart(uri, tcp:, p)) ret = tcp_start_incoming_migration(p); #if !defined(WIN32) @@ -99,6 +105,12 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) return -1; } +/* check ft_mode (Kemari protocol) */ +if (strstart(uri, kemari:, p)) { +ft_mode = FT_INIT; +uri = p; +} + if (strstart(uri, tcp:, p)) { s = tcp_start_outgoing_migration(mon, p, max_throttle, detach, blk, inc); diff --git a/qmp-commands.hx b/qmp-commands.hx index df40a3d..68ca48a 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -437,7 +437,9 @@ EQMP \n\t\t\t -b for migration without shared storage with full copy of disk\n\t\t\t -i for migration without shared storage with incremental copy of disk - (base image shared between src and destination), + (base image shared between src and destination) + \n\t\t\t put \kemari:\ in front of URI to enable + Fault Tolerance mode (Kemari protocol), .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, -- 1.7.1.2
[Qemu-devel] [PATCH 06/18] virtio: decrement last_avail_idx with inuse before saving.
For regular migration inuse == 0 always as requests are flushed before save. However, event-tap log when enabled introduces an extra queue for requests which is not being flushed, thus the last inuse requests are left in the event-tap queue. Move the last_avail_idx value sent to the remote back to make it repeat the last inuse requests. Signed-off-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/virtio.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index 31bd9e3..f05d1b6 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -673,12 +673,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) qemu_put_be32(f, i); for (i = 0; i VIRTIO_PCI_QUEUE_MAX; i++) { +/* For regular migration inuse == 0 always as + * requests are flushed before save. However, + * event-tap log when enabled introduces an extra + * queue for requests which is not being flushed, + * thus the last inuse requests are left in the event-tap queue. + * Move the last_avail_idx value sent to the remote back + * to make it repeat the last inuse requests. */ +uint16_t last_avail = vdev-vq[i].last_avail_idx - vdev-vq[i].inuse; if (vdev-vq[i].vring.num == 0) break; qemu_put_be32(f, vdev-vq[i].vring.num); qemu_put_be64(f, vdev-vq[i].pa); -qemu_put_be16s(f, vdev-vq[i].last_avail_idx); +qemu_put_be16s(f, last_avail); if (vdev-binding-save_queue) vdev-binding-save_queue(vdev-binding_opaque, i, f); } -- 1.7.1.2
[Qemu-devel] [PATCH 10/18] Call init handler of event-tap at main() in vl.c.
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- vl.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 4e263c3..614ac9c 100644 --- a/vl.c +++ b/vl.c @@ -162,6 +162,7 @@ int main(int argc, char **argv) #include qemu-queue.h #include cpus.h #include arch_init.h +#include event-tap.h #include ui/qemu-spice.h @@ -2931,6 +2932,8 @@ int main(int argc, char **argv, char **envp) blk_mig_init(); +event_tap_init(); + /* open the virtual block devices */ if (snapshot) qemu_opts_foreach(qemu_find_opts(drive), drive_enable_snapshot, NULL, 0); -- 1.7.1.2
[Qemu-devel] [PATCH 13/18] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async().
event-tap function is called only when it is on. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- net.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index ec4745d..724b549 100644 --- a/net.c +++ b/net.c @@ -36,6 +36,7 @@ #include qemu-common.h #include qemu_socket.h #include hw/qdev.h +#include event-tap.h static QTAILQ_HEAD(, VLANState) vlans; static QTAILQ_HEAD(, VLANClientState) non_vlan_clients; @@ -559,6 +560,10 @@ ssize_t qemu_send_packet_async(VLANClientState *sender, void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size) { +if (event_tap_is_on()) { +return event_tap_send_packet(vc, buf, size); +} + qemu_send_packet_async(vc, buf, size, NULL); } @@ -657,6 +662,10 @@ ssize_t qemu_sendv_packet_async(VLANClientState *sender, { NetQueue *queue; +if (event_tap_is_on()) { +return event_tap_sendv_packet_async(sender, iov, iovcnt, sent_cb); +} + if (sender-link_down || (!sender-peer !sender-vlan)) { return calc_iov_length(iov, iovcnt); } -- 1.7.1.2
[Qemu-devel] [PATCH 09/18] Introduce event-tap.
event-tap controls when to start FT transaction, and provides proxy functions to called from net/block devices. While FT transaction, it queues up net/block requests, and flush them when the transaction gets completed. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- Makefile.target |1 + event-tap.c | 940 +++ event-tap.h | 44 +++ qemu-tool.c | 28 ++ trace-events| 10 + 5 files changed, 1023 insertions(+), 0 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h diff --git a/Makefile.target b/Makefile.target index 220589e..da57efe 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,6 +199,7 @@ obj-y += rwhandler.o obj-$(CONFIG_KVM) += kvm.o kvm-all.o obj-$(CONFIG_NO_KVM) += kvm-stub.o LIBS+=-lz +obj-y += event-tap.o QEMU_CFLAGS += $(VNC_TLS_CFLAGS) QEMU_CFLAGS += $(VNC_SASL_CFLAGS) diff --git a/event-tap.c b/event-tap.c new file mode 100644 index 000..95c147a --- /dev/null +++ b/event-tap.c @@ -0,0 +1,940 @@ +/* + * Event Tap functions for QEMU + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include qemu-common.h +#include qemu-error.h +#include block.h +#include block_int.h +#include ioport.h +#include osdep.h +#include sysemu.h +#include hw/hw.h +#include net.h +#include event-tap.h +#include trace.h + +enum EVENT_TAP_STATE { +EVENT_TAP_OFF, +EVENT_TAP_ON, +EVENT_TAP_SUSPEND, +EVENT_TAP_FLUSH, +EVENT_TAP_LOAD, +EVENT_TAP_REPLAY, +}; + +static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF; + +typedef struct EventTapIOport { +uint32_t address; +uint32_t data; +int index; +} EventTapIOport; + +#define MMIO_BUF_SIZE 8 + +typedef struct EventTapMMIO { +uint64_t address; +uint8_t buf[MMIO_BUF_SIZE]; +int len; +} EventTapMMIO; + +typedef struct EventTapNetReq { +char *device_name; +int iovcnt; +int vlan_id; +bool vlan_needed; +bool async; +struct iovec *iov; +NetPacketSent *sent_cb; +} EventTapNetReq; + +#define MAX_BLOCK_REQUEST 32 + +typedef struct EventTapAIOCB EventTapAIOCB; + +typedef struct EventTapBlkReq { +char *device_name; +int num_reqs; +int num_cbs; +bool is_flush; +BlockRequest reqs[MAX_BLOCK_REQUEST]; +EventTapAIOCB *acb[MAX_BLOCK_REQUEST]; +} EventTapBlkReq; + +#define EVENT_TAP_IOPORT (1 0) +#define EVENT_TAP_MMIO (1 1) +#define EVENT_TAP_NET(1 2) +#define EVENT_TAP_BLK(1 3) + +#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1) + +typedef struct EventTapLog { +int mode; +union { +EventTapIOport ioport; +EventTapMMIO mmio; +}; +union { +EventTapNetReq net_req; +EventTapBlkReq blk_req; +}; +QTAILQ_ENTRY(EventTapLog) node; +} EventTapLog; + +struct EventTapAIOCB { +BlockDriverAIOCB common; +BlockDriverAIOCB *acb; +bool is_canceled; +}; + +static EventTapLog *last_event_tap; + +static QTAILQ_HEAD(, EventTapLog) event_list; +static QTAILQ_HEAD(, EventTapLog) event_pool; + +static int (*event_tap_cb)(void); +static QEMUBH *event_tap_bh; +static VMChangeStateEntry *vmstate; + +static void event_tap_bh_cb(void *p) +{ +if (event_tap_cb) { +event_tap_cb(); +} + +qemu_bh_delete(event_tap_bh); +event_tap_bh = NULL; +} + +static void event_tap_schedule_bh(void) +{ +trace_event_tap_ignore_bh(!!event_tap_bh); + +/* if bh is already set, we ignore it for now */ +if (event_tap_bh) { +return; +} + +event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL); +qemu_bh_schedule(event_tap_bh); + +return; +} + +static void *event_tap_alloc_log(void) +{ +EventTapLog *log; + +if (QTAILQ_EMPTY(event_pool)) { +log = qemu_mallocz(sizeof(EventTapLog)); +} else { +log = QTAILQ_FIRST(event_pool); +QTAILQ_REMOVE(event_pool, log, node); +} + +return log; +} + +static void event_tap_free_net_req(EventTapNetReq *net_req); +static void event_tap_free_blk_req(EventTapBlkReq *blk_req); + +static void event_tap_free_log(EventTapLog *log) +{ +int mode = log-mode ~EVENT_TAP_TYPE_MASK; + +if (mode == EVENT_TAP_NET) { +event_tap_free_net_req(log-net_req); +} else if (mode == EVENT_TAP_BLK) { +event_tap_free_blk_req(log-blk_req); +} + +log-mode = 0; + +/* return the log to event_pool */ +QTAILQ_INSERT_HEAD(event_pool, log, node); +} + +static void event_tap_free_pool(void) +{ +EventTapLog *log, *next; + +QTAILQ_FOREACH_SAFE(log, event_pool, node, next) { +QTAILQ_REMOVE(event_pool, log, node); +qemu_free(log); +} +} + +static void event_tap_free_net_req(EventTapNetReq *net_req) +{ +int i; + +if (!net_req-async) { +for
[Qemu-devel] [PATCH 00/18] Kemari for KVM v0.2.11
Hi, This patch series is a revised version of Kemari for KVM, which applied comments for the previous post. The current code is based on qemu.git 9a31334f419c1d773cf4b4bfbbdace96fbf8a4f4. The changes from v0.2.10 - v0.2.11 are: - rebased to 0.14 - upon unregistering event-tap, set event_tap_state after event_tap_flush - modify commit log of 02/18 that it won't make existing migration bi-directional. The changes from v0.2.9 - v0.2.10 are: - change migrate format to kemari:protocol:host:port (Paolo) The changes from v0.2.8 - v0.2.9 are: - abstract common code between qemu_savevm_{state,trans}_* (Paolo) - change incoming format to kemari:protocol:host:port (Paolo) The changes from v0.2.7 - v0.2.8 are: - fixed calling wrong cb in event-tap - add missing qemu_aio_release in event-tap The changes from v0.2.6 - v0.2.7 are: - add AIOCB, AIOPool and cancel functions (Kevin) - insert event-tap for bdrv_flush (Kevin) - add error handing when calling bdrv functions (Kevin) - fix usage of qemu_aio_flush and bdrv_flush (Kevin) - use bs in AIOCB on the primary (Kevin) - reorder event-tap functions to gather with block/net (Kevin) - fix checking bs-device_name (Kevin) The changes from v0.2.5 - v0.2.6 are: - use qemu_{put,get}_be32() to save/load niov in event-tap The changes from v0.2.4 - v0.2.5 are: - fixed braces and trailing spaces by using Blue's checkpatch.pl (Blue) - event-tap: don't try to send blk_req if it's a bdrv_aio_flush event The changes from v0.2.3 - v0.2.4 are: - call vm_start() before event_tap_flush_one() to avoid failure in virtio-net assertion - add vm_change_state_handler to turn off ft_mode - use qemu_iovec functions in event-tap - remove duplicated code in migration - remove unnecessary new line for error_report in ft_trans_file The changes from v0.2.2 - v0.2.3 are: - queue async net requests without copying (MST) -- if not async, contents of the packets are sent to the secondary - better description for option -k (MST) - fix memory transfer failure - fix ft transaction initiation failure The changes from v0.2.1 - v0.2.2 are: - decrement last_avaid_idx with inuse before saving (MST) - remove qemu_aio_flush() and bdrv_flush_all() in migrate_ft_trans_commit() The changes from v0.2 - v0.2.1 are: - Move event-tap to net/block layer and use stubs (Blue, Paul, MST, Kevin) - Tap bdrv_aio_flush (Marcelo) - Remove multiwrite interface in event-tap (Stefan) - Fix event-tap to use pio/mmio to replay both net/block (Stefan) - Improve error handling in event-tap (Stefan) - Fix leak in event-tap (Stefan) - Revise virtio last_avail_idx manipulation (MST) - Clean up migration.c hook (Marcelo) - Make deleting change state handler robust (Isaku, Anthony) The changes from v0.1.1 - v0.2 are: - Introduce a queue in event-tap to make VM sync live. - Change transaction receiver to a state machine for async receiving. - Replace net/block layer functions with event-tap proxy functions. - Remove dirty bitmap optimization for now. - convert DPRINTF() in ft_trans_file to trace functions. - convert fprintf() in ft_trans_file to error_report(). - improved error handling in ft_trans_file. - add a tmp pointer to qemu_del_vm_change_state_handler. The changes from v0.1 - v0.1.1 are: - events are tapped in net/block layer instead of device emulation layer. - Introduce a new option for -incoming to accept FT transaction. - Removed writev() support to QEMUFile and FdMigrationState for now. I would post this work in a different series. - Modified virtio-blk save/load handler to send inuse variable to correctly replay. - Removed configure --enable-ft-mode. - Removed unnecessary check for qemu_realloc(). The first 6 patches modify several functions of qemu to prepare introducing Kemari specific components. The next 6 patches are the components of Kemari. They introduce event-tap and the FT transaction protocol file based on buffered file. The design document of FT transaction protocol can be found at, http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf Then the following 2 patches modifies net/block layer functions with event-tap functions. Please note that if Kemari is off, event-tap will just passthrough, and there is most no intrusion to exisiting functions including normal live migration. Finally, the migration layer are modified to support Kemari in the last 4 patches. Again, there shouldn't be any affection if a user doesn't specify Kemari specific options. The transaction is now async on both sender and receiver side. The sender side respects the max_downtime to decide when to switch from async to sync mode. The repository contains all patches I'm sending with this message. For those who want to try, please pull the following repository. It also includes dirty bitmap optimization which aren't ready for posting yet. To remove the dirty bitmap optimization, please look at HEAD~4 of the tree. git://kemari.git.sourceforge.net/gitroot/kemari/kemari next Thanks, Yoshi
[Qemu-devel] [PATCH 12/18] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.
Record mmio write event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- exec.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index d611100..e192eec 100644 --- a/exec.c +++ b/exec.c @@ -33,6 +33,7 @@ #include osdep.h #include kvm.h #include qemu-timer.h +#include event-tap.h #if defined(CONFIG_USER_ONLY) #include qemu.h #include signal.h @@ -3662,6 +3663,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf, io_index = (pd IO_MEM_SHIFT) (IO_MEM_NB_ENTRIES - 1); if (p) addr1 = (addr ~TARGET_PAGE_MASK) + p-region_offset; + +event_tap_mmio(addr, buf, len); + /* XXX: could force cpu_single_env to NULL to avoid potential bugs */ if (l = 4 ((addr1 3) == 0)) { -- 1.7.1.2
[Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.
Currently FdMigrationState doesn't support read(), and this patch introduces it to get response from the other side. Note that this won't change the existing migration protocol to be bi-directional. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration-tcp.c | 15 +++ migration.c | 13 + migration.h |3 +++ 3 files changed, 31 insertions(+), 0 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index b55f419..55777c8 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * buf, size_t size) return send(s-fd, buf, size, 0); } +static int socket_read(FdMigrationState *s, const void * buf, size_t size) +{ +ssize_t len; + +do { +len = recv(s-fd, (void *)buf, size, 0); +} while (len == -1 socket_error() == EINTR); +if (len == -1) { +len = -socket_error(); +} + +return len; +} + static int tcp_close(FdMigrationState *s) { DPRINTF(tcp_close\n); @@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, s-get_error = socket_errno; s-write = socket_write; +s-read = socket_read; s-close = tcp_close; s-mig_state.cancel = migrate_fd_cancel; s-mig_state.get_status = migrate_fd_get_status; diff --git a/migration.c b/migration.c index af3a1f2..302b8fe 100644 --- a/migration.c +++ b/migration.c @@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) return ret; } +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size) +{ +FdMigrationState *s = opaque; +int ret; + +ret = s-read(s, data, size); +if (ret == -1) { +ret = -(s-get_error(s)); +} + +return ret; +} + void migrate_fd_connect(FdMigrationState *s) { int ret; diff --git a/migration.h b/migration.h index 2170792..88a6987 100644 --- a/migration.h +++ b/migration.h @@ -48,6 +48,7 @@ struct FdMigrationState int (*get_error)(struct FdMigrationState*); int (*close)(struct FdMigrationState*); int (*write)(struct FdMigrationState*, const void *, size_t); +int (*read)(struct FdMigrationState *, const void *, size_t); void *opaque; }; @@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque); ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size); +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size); + void migrate_fd_connect(FdMigrationState *s); void migrate_fd_put_ready(void *opaque); -- 1.7.1.2
[Qemu-devel] [PATCH 11/18] ioport: insert event_tap_ioport() to ioport_write().
Record ioport event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- ioport.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/ioport.c b/ioport.c index aa4188a..74aebf5 100644 --- a/ioport.c +++ b/ioport.c @@ -27,6 +27,7 @@ #include ioport.h #include trace.h +#include event-tap.h /***/ /* IO Port */ @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data) default_ioport_writel }; IOPortWriteFunc *func = ioport_write_table[index][address]; +event_tap_ioport(index, address, data); if (!func) func = default_func[index]; func(ioport_opaque[address], address, data); -- 1.7.1.2
[Qemu-devel] [PATCH 15/18] savevm: introduce qemu_savevm_trans_{begin, commit}.
Introduce qemu_savevm_trans_{begin,commit} to send the memory and device info together, while avoiding cancelling memory state tracking. This patch also abstracts common code between qemu_savevm_state_{begin,iterate,commit}. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- savevm.c | 157 +++--- sysemu.h |2 + 2 files changed, 101 insertions(+), 58 deletions(-) diff --git a/savevm.c b/savevm.c index 78c1972..c96a393 100644 --- a/savevm.c +++ b/savevm.c @@ -1601,29 +1601,68 @@ bool qemu_savevm_state_blocked(Monitor *mon) return false; } -int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, -int shared) +/* + * section: header to write + * inc: if true, forces to pass SECTION_PART instead of SECTION_START + * pause: if true, breaks the loop when live handler returned 0 + */ +static int qemu_savevm_state_live(Monitor *mon, QEMUFile *f, int section, + bool inc, bool pause) { SaveStateEntry *se; +int skip = 0, ret; QTAILQ_FOREACH(se, savevm_handlers, entry) { -if(se-set_params == NULL) { +int len, stage; + +if (se-save_live_state == NULL) { continue; - } - se-set_params(blk_enable, shared, se-opaque); +} + +/* Section type */ +qemu_put_byte(f, section); +qemu_put_be32(f, se-section_id); + +if (section == QEMU_VM_SECTION_START) { +/* ID string */ +len = strlen(se-idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se-idstr, len); + +qemu_put_be32(f, se-instance_id); +qemu_put_be32(f, se-version_id); + +stage = inc ? QEMU_VM_SECTION_PART : QEMU_VM_SECTION_START; +} else { +assert(inc); +stage = section; +} + +ret = se-save_live_state(mon, f, stage, se-opaque); +if (!ret) { +skip++; +if (pause) { +break; +} +} } - -qemu_put_be32(f, QEMU_VM_FILE_MAGIC); -qemu_put_be32(f, QEMU_VM_FILE_VERSION); + +return skip; +} + +static void qemu_savevm_state_full(QEMUFile *f) +{ +SaveStateEntry *se; QTAILQ_FOREACH(se, savevm_handlers, entry) { int len; -if (se-save_live_state == NULL) +if (se-save_state == NULL se-vmsd == NULL) { continue; +} /* Section type */ -qemu_put_byte(f, QEMU_VM_SECTION_START); +qemu_put_byte(f, QEMU_VM_SECTION_FULL); qemu_put_be32(f, se-section_id); /* ID string */ @@ -1634,9 +1673,29 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, qemu_put_be32(f, se-instance_id); qemu_put_be32(f, se-version_id); -se-save_live_state(mon, f, QEMU_VM_SECTION_START, se-opaque); +vmstate_save(f, se); +} + +qemu_put_byte(f, QEMU_VM_EOF); +} + +int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, +int shared) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, savevm_handlers, entry) { +if (se-set_params == NULL) { +continue; +} +se-set_params(blk_enable, shared, se-opaque); } +qemu_put_be32(f, QEMU_VM_FILE_MAGIC); +qemu_put_be32(f, QEMU_VM_FILE_VERSION); + +qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_START, 0, 0); + if (qemu_file_has_error(f)) { qemu_savevm_state_cancel(mon, f); return -EIO; @@ -1647,29 +1706,16 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f) { -SaveStateEntry *se; int ret = 1; -QTAILQ_FOREACH(se, savevm_handlers, entry) { -if (se-save_live_state == NULL) -continue; - -/* Section type */ -qemu_put_byte(f, QEMU_VM_SECTION_PART); -qemu_put_be32(f, se-section_id); - -ret = se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque); -if (!ret) { -/* Do not proceed to the next vmstate before this one reported - completion of the current stage. This serializes the migration - and reduces the probability that a faster changing state is - synchronized over and over again. */ -break; -} -} - -if (ret) +/* Do not proceed to the next vmstate before this one reported + completion of the current stage. This serializes the migration + and reduces the probability that a faster changing state is + synchronized over and over again. */ +ret = qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_PART, 1, 1); +if (!ret) { return 1; +} if (qemu_file_has_error(f)) { qemu_savevm_state_cancel(mon, f); @@ -1681,46 +1727,41 @@ int
[Qemu-devel] [PATCH 05/18] vl.c: add deleted flag for deleting the handler.
Make deleting handlers robust against deletion of any elements in a handler by using a deleted flag like in file descriptors. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- vl.c | 13 + 1 files changed, 9 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index b436952..4e263c3 100644 --- a/vl.c +++ b/vl.c @@ -1158,6 +1158,7 @@ static void nographic_update(void *opaque) struct vm_change_state_entry { VMChangeStateHandler *cb; void *opaque; +int deleted; QLIST_ENTRY (vm_change_state_entry) entries; }; @@ -1178,8 +1179,7 @@ VMChangeStateEntry *qemu_add_vm_change_state_handler(VMChangeStateHandler *cb, void qemu_del_vm_change_state_handler(VMChangeStateEntry *e) { -QLIST_REMOVE (e, entries); -qemu_free (e); +e-deleted = 1; } void vm_state_notify(int running, int reason) @@ -1188,8 +1188,13 @@ void vm_state_notify(int running, int reason) trace_vm_state_notify(running, reason); -for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) { -e-cb(e-opaque, running, reason); +QLIST_FOREACH(e, vm_change_state_head, entries) { +if (e-deleted) { +QLIST_REMOVE(e, entries); +qemu_free(e); +} else { +e-cb(e-opaque, running, reason); +} } } -- 1.7.1.2
[Qemu-devel] [PATCH 17/18] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.
When ft_mode is set in the header, tcp_accept_incoming_migration() sets ft_trans_incoming() as a callback, and call qemu_file_get_notify() to receive FT transaction iteratively. We also need a hack no to close fd before moving to ft_transaction mode, so that we can reuse the fd for it. vm_change_state_handler is added to turn off ft_mode when cont is pressed. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration-tcp.c | 67 ++- 1 files changed, 66 insertions(+), 1 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index 55777c8..84076d6 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -18,6 +18,8 @@ #include sysemu.h #include buffered_file.h #include block.h +#include ft_trans_file.h +#include event-tap.h //#define DEBUG_MIGRATION_TCP @@ -29,6 +31,8 @@ do { } while (0) #endif +static VMChangeStateEntry *vmstate; + static int socket_errno(FdMigrationState *s) { return socket_error(); @@ -56,7 +60,8 @@ static int socket_read(FdMigrationState *s, const void * buf, size_t size) static int tcp_close(FdMigrationState *s) { DPRINTF(tcp_close\n); -if (s-fd != -1) { +/* FIX ME: accessing ft_mode here isn't clean */ +if (s-fd != -1 ft_mode != FT_INIT) { close(s-fd); s-fd = -1; } @@ -150,6 +155,36 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, return s-mig_state; } +static void ft_trans_incoming(void *opaque) +{ +QEMUFile *f = opaque; + +qemu_file_get_notify(f); +if (qemu_file_has_error(f)) { +ft_mode = FT_ERROR; +qemu_fclose(f); +} +} + +static void ft_trans_reset(void *opaque, int running, int reason) +{ +QEMUFile *f = opaque; + +if (running) { +if (ft_mode != FT_ERROR) { +qemu_fclose(f); +} +ft_mode = FT_OFF; +qemu_del_vm_change_state_handler(vmstate); +} +} + +static void ft_trans_schedule_replay(QEMUFile *f) +{ +event_tap_schedule_replay(); +vmstate = qemu_add_vm_change_state_handler(ft_trans_reset, f); +} + static void tcp_accept_incoming_migration(void *opaque) { struct sockaddr_in addr; @@ -175,8 +210,38 @@ static void tcp_accept_incoming_migration(void *opaque) goto out; } +if (ft_mode == FT_INIT) { +autostart = 0; +} + process_incoming_migration(f); + +if (ft_mode == FT_INIT) { +int ret; + +socket_set_nodelay(c); + +f = qemu_fopen_ft_trans(s, c); +if (f == NULL) { +fprintf(stderr, could not qemu_fopen_ft_trans\n); +goto out; +} + +/* need to wait sender to setup */ +ret = qemu_ft_trans_begin(f); +if (ret 0) { +goto out; +} + +qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f); +ft_trans_schedule_replay(f); +ft_mode = FT_TRANSACTION_RECV; + +return; +} + qemu_fclose(f); + out: close(c); out2: -- 1.7.1.2
[Qemu-devel] [PATCH 14/18] block: insert event-tap to bdrv_aio_writev(), bdrv_aio_flush() and bdrv_flush().
event-tap function is called only when it is on, and requests were sent from device emulators. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Acked-by: Kevin Wolf kw...@redhat.com --- block.c | 15 +++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index f7d91a2..b19729a 100644 --- a/block.c +++ b/block.c @@ -28,6 +28,7 @@ #include block_int.h #include module.h #include qemu-objects.h +#include event-tap.h #ifdef CONFIG_BSD #include sys/types.h @@ -1585,6 +1586,10 @@ int bdrv_flush(BlockDriverState *bs) } if (bs-drv bs-drv-bdrv_flush) { +if (*bs-device_name event_tap_is_on()) { +event_tap_bdrv_flush(); +} + return bs-drv-bdrv_flush(bs); } @@ -2220,6 +2225,11 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num, if (bdrv_check_request(bs, sector_num, nb_sectors)) return NULL; +if (*bs-device_name event_tap_is_on()) { +return event_tap_bdrv_aio_writev(bs, sector_num, qiov, nb_sectors, + cb, opaque); +} + if (bs-dirty_bitmap) { blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb, opaque); @@ -2483,6 +2493,11 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs, if (!drv) return NULL; + +if (*bs-device_name event_tap_is_on()) { +return event_tap_bdrv_aio_flush(bs, cb, opaque); +} + return drv-bdrv_aio_flush(bs, cb, opaque); } -- 1.7.1.2
[Qemu-devel] [PATCH 08/18] savevm: introduce util functions to control ft_trans_file from savevm layer.
To utilize ft_trans_file function, savevm needs interfaces to be exported. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/hw.h |5 ++ savevm.c | 149 ++ 2 files changed, 154 insertions(+), 0 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index a168a37..a9eff5a 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, QEMUFile *qemu_fopen(const char *filename, const char *mode); QEMUFile *qemu_fdopen(int fd, const char *mode); QEMUFile *qemu_fopen_socket(int fd); +QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd); QEMUFile *qemu_popen(FILE *popen_file, const char *mode); QEMUFile *qemu_popen_cmd(const char *command, const char *mode); int qemu_stdio_fd(QEMUFile *f); @@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); void *qemu_realloc_buffer(QEMUFile *f, int size); void qemu_clear_buffer(QEMUFile *f); +int qemu_ft_trans_begin(QEMUFile *f); +int qemu_ft_trans_commit(QEMUFile *f); +int qemu_ft_trans_cancel(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { @@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f); * halted due to rate limiting or EAGAIN errors occur as it can be used to * resume output. */ void qemu_file_put_notify(QEMUFile *f); +void qemu_file_get_notify(void *opaque); static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv) { diff --git a/savevm.c b/savevm.c index 52d5be8..78c1972 100644 --- a/savevm.c +++ b/savevm.c @@ -82,6 +82,7 @@ #include migration.h #include qemu_socket.h #include qemu-queue.h +#include ft_trans_file.h #define SELF_ANNOUNCE_ROUNDS 5 @@ -189,6 +190,13 @@ typedef struct QEMUFileSocket QEMUFile *file; } QEMUFileSocket; +typedef struct QEMUFileSocketTrans +{ +int fd; +QEMUFileSocket *s; +VMChangeStateEntry *e; +} QEMUFileSocketTrans; + static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size) { QEMUFileSocket *s = opaque; @@ -204,6 +212,22 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size) return len; } +static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size) +{ +QEMUFileSocket *s = opaque; +ssize_t len; + +do { +len = send(s-fd, (void *)buf, size, 0); +} while (len == -1 socket_error() == EINTR); + +if (len == -1) { +len = -socket_error(); +} + +return len; +} + static int socket_close(void *opaque) { QEMUFileSocket *s = opaque; @@ -211,6 +235,70 @@ static int socket_close(void *opaque) return 0; } +static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, size_t size) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; +ssize_t len; + +len = socket_get_buffer(s, buf, pos, size); + +return len; +} + +static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t size) +{ +QEMUFileSocketTrans *t = opaque; + +return socket_put_buffer(t-s, buf, size); +} + + +static int socket_trans_get_ready(void *opaque) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; +QEMUFile *f = s-file; +int ret = 0; + +ret = qemu_loadvm_state(f, 1); +if (ret 0) { +fprintf(stderr, +socket_trans_get_ready: error while loading vmstate\n); +} + +return ret; +} + +static int socket_trans_close(void *opaque) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; + +qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL); +qemu_set_fd_handler2(t-fd, NULL, NULL, NULL, NULL); +qemu_del_vm_change_state_handler(t-e); +close(s-fd); +close(t-fd); +qemu_free(s); +qemu_free(t); + +return 0; +} + +static void socket_trans_resume(void *opaque, int running, int reason) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; + +if (!running) { +return; +} + +qemu_announce_self(); +qemu_fclose(s-file); +} + static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size) { QEMUFileStdio *s = opaque; @@ -333,6 +421,26 @@ QEMUFile *qemu_fopen_socket(int fd) return s-file; } +QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd) +{ +QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans)); +QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket)); + +t-s = s; +t-fd = s_fd; +t-e = qemu_add_vm_change_state_handler(socket_trans_resume, t); + +s-fd = c_fd; +s-file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer, + socket_trans_get_buffer, NULL, + socket_trans_get_ready, + migrate_fd_wait_for_unfreeze, + socket_trans_close, 0);
[Qemu-devel] [PATCH] docs: Update stderr and simple backend, add systemtap backend
The following additions to the tracing documentation are included: 1. Move stderr backend documentation to top-level and out of simple backend. Include hints on when this backend is useful. 2. Document the simple backend thread-safety limitation. 3. Document the dtrace backend for SystemTap. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- docs/tracing.txt | 30 ++ 1 files changed, 26 insertions(+), 4 deletions(-) diff --git a/docs/tracing.txt b/docs/tracing.txt index 21183f9..a6cc56f 100644 --- a/docs/tracing.txt +++ b/docs/tracing.txt @@ -126,6 +126,14 @@ The nop backend generates empty trace event functions so that the compiler can optimize out trace events completely. This is the default and imposes no performance penalty. +=== Stderr === + +The stderr backend sends trace events directly to standard error. This +effectively turns trace events into debug printfs. + +This is the simplest backend and can be used together with existing code that +uses DPRINTF(). + === Simpletrace === The simple backend supports common use cases and comes as part of the QEMU @@ -133,10 +141,10 @@ source tree. It may not be as powerful as platform-specific or third-party trace backends but it is portable. This is the recommended trace backend unless you have specific needs for more advanced backends. -=== Stderr === - -The stderr backend sends trace events directly to standard error output -during emulation. +Warning: the simple backend is not thread-safe so only enable trace events +that are executed while the global mutex is held. Much of QEMU meets this +requirement but some utility functions like qemu_malloc() or thread-related +code cannot be safely traced using the simple backend. Monitor commands @@ -187,3 +195,17 @@ consistent. The ust backend uses the LTTng Userspace Tracer library. There are no monitor commands built into QEMU, instead UST utilities should be used to list, enable/disable, and dump traces. + +=== SystemTap === + +The dtrace backend uses DTrace sdt probes but has only been tested with +SystemTap. When SystemTap support is detected a .stp file with wrapper probes +is generated to make use in scripts more convenient. This step can also be +performed manually after a build in order to change the binary name in the .stp +probes: + +scripts/tracetool --dtrace --stap \ + --binary path/to/qemu-binary \ + --target-type system \ + --target-arch x86_64 \ + trace-events qemu.stp -- 1.7.2.3
[Qemu-devel] [PATCH 07/18] Introduce fault tolerant VM transaction QEMUFile and ft_mode.
This code implements VM transaction protocol. Like buffered_file, it sits between savevm and migration layer. With this architecture, VM transaction protocol is implemented mostly independent from other existing code. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- Makefile.objs |1 + ft_trans_file.c | 624 +++ ft_trans_file.h | 72 +++ migration.c |3 + trace-events| 15 ++ 5 files changed, 715 insertions(+), 0 deletions(-) create mode 100644 ft_trans_file.c create mode 100644 ft_trans_file.h diff --git a/Makefile.objs b/Makefile.objs index c144df1..8856160 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -100,6 +100,7 @@ common-obj-y += msmouse.o ps2.o common-obj-y += qdev.o qdev-properties.o common-obj-y += block-migration.o common-obj-y += pflib.o +common-obj-y += ft_trans_file.o common-obj-$(CONFIG_BRLAPI) += baum.o common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o diff --git a/ft_trans_file.c b/ft_trans_file.c new file mode 100644 index 000..2b42b95 --- /dev/null +++ b/ft_trans_file.c @@ -0,0 +1,624 @@ +/* + * Fault tolerant VM transaction QEMUFile + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * This source code is based on buffered_file.c. + * Copyright IBM, Corp. 2008 + * Authors: + * Anthony Liguorialigu...@us.ibm.com + */ + +#include qemu-common.h +#include qemu-error.h +#include hw/hw.h +#include qemu-timer.h +#include sysemu.h +#include qemu-char.h +#include trace.h +#include ft_trans_file.h + +typedef struct FtTransHdr +{ +uint16_t cmd; +uint16_t id; +uint32_t seq; +uint32_t payload_len; +} FtTransHdr; + +typedef struct QEMUFileFtTrans +{ +FtTransPutBufferFunc *put_buffer; +FtTransGetBufferFunc *get_buffer; +FtTransPutReadyFunc *put_ready; +FtTransGetReadyFunc *get_ready; +FtTransWaitForUnfreezeFunc *wait_for_unfreeze; +FtTransCloseFunc *close; +void *opaque; +QEMUFile *file; + +enum QEMU_VM_TRANSACTION_STATE state; +uint32_t seq; +uint16_t id; + +int has_error; + +bool freeze_output; +bool freeze_input; +bool rate_limit; +bool is_sender; +bool is_payload; + +uint8_t *buf; +size_t buf_max_size; +size_t put_offset; +size_t get_offset; + +FtTransHdr header; +size_t header_offset; +} QEMUFileFtTrans; + +#define IO_BUF_SIZE 32768 + +static void ft_trans_append(QEMUFileFtTrans *s, +const uint8_t *buf, size_t size) +{ +if (size (s-buf_max_size - s-put_offset)) { +trace_ft_trans_realloc(s-buf_max_size, size + 1024); +s-buf_max_size += size + 1024; +s-buf = qemu_realloc(s-buf, s-buf_max_size); +} + +trace_ft_trans_append(size); +memcpy(s-buf + s-put_offset, buf, size); +s-put_offset += size; +} + +static void ft_trans_flush(QEMUFileFtTrans *s) +{ +size_t offset = 0; + +if (s-has_error) { +error_report(flush when error %d, bailing, s-has_error); +return; +} + +while (offset s-put_offset) { +ssize_t ret; + +ret = s-put_buffer(s-opaque, s-buf + offset, s-put_offset - offset); +if (ret == -EAGAIN) { +break; +} + +if (ret = 0) { +error_report(error flushing data, %s, strerror(errno)); +s-has_error = FT_TRANS_ERR_FLUSH; +break; +} else { +offset += ret; +} +} + +trace_ft_trans_flush(offset, s-put_offset); +memmove(s-buf, s-buf + offset, s-put_offset - offset); +s-put_offset -= offset; +s-freeze_output = !!s-put_offset; +} + +static ssize_t ft_trans_put(void *opaque, void *buf, int size) +{ +QEMUFileFtTrans *s = opaque; +size_t offset = 0; +ssize_t len; + +/* flush buffered data before putting next */ +if (s-put_offset) { +ft_trans_flush(s); +} + +while (!s-freeze_output offset size) { +len = s-put_buffer(s-opaque, (uint8_t *)buf + offset, size - offset); + +if (len == -EAGAIN) { +trace_ft_trans_freeze_output(); +s-freeze_output = 1; +break; +} + +if (len = 0) { +error_report(putting data failed, %s, strerror(errno)); +s-has_error = 1; +offset = -EINVAL; +break; +} + +offset += len; +} + +if (s-freeze_output) { +ft_trans_append(s, buf + offset, size - offset); +offset = size; +} + +return offset; +} + +static int ft_trans_send_header(QEMUFileFtTrans *s, +enum QEMU_VM_TRANSACTION_STATE state, +uint32_t payload_len) +{ +int ret; +FtTransHdr
Re: [Qemu-devel] [RFC PATCH] block: Fix eject -f for locked devices
Sorry for my slow reply. Blue Swirl blauwir...@gmail.com writes: On Fri, Feb 18, 2011 at 5:16 PM, Markus Armbruster arm...@redhat.com wrote: From 8cd4978c9be6ff2bcc414bb1c1b258b96b9a74c1 Mon Sep 17 00:00:00 2001 From: Markus Armbruster arm...@redhat.com Date: Fri, 18 Feb 2011 15:54:02 +0100 After forcefully ejecting media locked by the guest, you can't ever again insert new media. Example: (qemu) info block hda: type=hd removable=0 file=test.img ro=0 drv=raw encrypted=0 cd: type=cdrom removable=1 locked=1 file=x.iso ro=0 drv=raw encrypted=0 (qemu) eject cd Device 'cd' is locked (qemu) eject -f cd (qemu) info block hda: type=hd removable=0 file=test.img ro=0 drv=raw encrypted=0 cd: type=cdrom removable=1 locked=1 [not inserted] (qemu) change cd x.iso Device 'cd' is locked Signed-off-by: Markus Armbruster arm...@redhat.com --- I'm not entirely sure this is the appropriate fix, and that's why there's RFC in the subject. Both IDE and SCSI devices expose their drive's BlockDriverState member locked to the guest via mode sense. What does real hardware do when I force-eject media (typically by rummaging in that little hole with a paperclip)? Does it actively notify the OS? Does mode sense change? No idea, but IIRC the drive is still usable after that, That's what I'd expect. If the OS recovers from losing the media, the drive should be fine. so locking the drive does not look correct. I'm not sure I get you here. A possible alternative fix is to make do_change_block() ignore bdrv_is_locked() when inserting media into an empty drive. Then the meaning of locked would change, maybe eject_disabled would then describe the state better. I like the current meaning of BlockDriverState member locked just fine, it nicely mirrors the SCSI / ATAPI mode sense page bit. I'm worried about unwanted side effects of my change. For instance, what about this code in do_snapshot_blkdev(): flags = bs-open_flags; bdrv_close(bs); ret = bdrv_open(bs, filename, flags, drv); I'm afraid my change would make it screw up bs-locked. Jes?
[Qemu-devel] [PATCH 16/18] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.
Introduce migrate_ft_trans_put_ready() which kicks the FT transaction cycle. When ft_mode is on, migrate_fd_put_ready() would open ft_trans_file and turn on event_tap. To end or cancel FT transaction, ft_mode and event_tap is turned off. migrate_ft_trans_get_ready() is called to receive ack from the receiver. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c | 261 ++- 1 files changed, 260 insertions(+), 1 deletions(-) diff --git a/migration.c b/migration.c index 3be3554..82f4a4d 100644 --- a/migration.c +++ b/migration.c @@ -21,6 +21,7 @@ #include qemu_socket.h #include block-migration.h #include qemu-objects.h +#include event-tap.h //#define DEBUG_MIGRATION @@ -283,6 +284,14 @@ void migrate_fd_error(FdMigrationState *s) migrate_fd_cleanup(s); } +static void migrate_ft_trans_error(FdMigrationState *s) +{ +ft_mode = FT_ERROR; +qemu_savevm_state_cancel(s-mon, s-file); +migrate_fd_error(s); +event_tap_unregister(); +} + int migrate_fd_cleanup(FdMigrationState *s) { int ret = 0; @@ -318,6 +327,17 @@ void migrate_fd_put_notify(void *opaque) qemu_file_put_notify(s-file); } +static void migrate_fd_get_notify(void *opaque) +{ +FdMigrationState *s = opaque; + +qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL); +qemu_file_get_notify(s-file); +if (qemu_file_has_error(s-file)) { +migrate_ft_trans_error(s); +} +} + ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) { FdMigrationState *s = opaque; @@ -353,6 +373,10 @@ int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size) ret = -(s-get_error(s)); } +if (ret == -EAGAIN) { +qemu_set_fd_handler2(s-fd, NULL, migrate_fd_get_notify, NULL, s); +} + return ret; } @@ -379,6 +403,230 @@ void migrate_fd_connect(FdMigrationState *s) migrate_fd_put_ready(s); } +static int migrate_ft_trans_commit(void *opaque) +{ +FdMigrationState *s = opaque; +int ret = -1; + +if (ft_mode != FT_TRANSACTION_COMMIT ft_mode != FT_TRANSACTION_ATOMIC) { +fprintf(stderr, +migrate_ft_trans_commit: invalid ft_mode %d\n, ft_mode); +goto out; +} + +do { +if (ft_mode == FT_TRANSACTION_ATOMIC) { +if (qemu_ft_trans_begin(s-file) 0) { +fprintf(stderr, qemu_ft_trans_begin failed\n); +goto out; +} + +ret = qemu_savevm_trans_begin(s-mon, s-file, 0); +if (ret 0) { +fprintf(stderr, qemu_savevm_trans_begin failed\n); +goto out; +} + +ft_mode = FT_TRANSACTION_COMMIT; +if (ret) { +/* don't proceed until if fd isn't ready */ +goto out; +} +} + +/* make the VM state consistent by flushing outstanding events */ +vm_stop(0); + +/* send at full speed */ +qemu_file_set_rate_limit(s-file, 0); + +ret = qemu_savevm_trans_complete(s-mon, s-file); +if (ret 0) { +fprintf(stderr, qemu_savevm_trans_complete failed\n); +goto out; +} + +ret = qemu_ft_trans_commit(s-file); +if (ret 0) { +fprintf(stderr, qemu_ft_trans_commit failed\n); +goto out; +} + +if (ret) { +ft_mode = FT_TRANSACTION_RECV; +ret = 1; +goto out; +} + +/* flush and check if events are remaining */ +vm_start(); +ret = event_tap_flush_one(); +if (ret 0) { +fprintf(stderr, event_tap_flush_one failed\n); +goto out; +} + +ft_mode = ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC; +} while (ft_mode != FT_TRANSACTION_BEGIN); + +vm_start(); +ret = 0; + +out: +return ret; +} + +static int migrate_ft_trans_get_ready(void *opaque) +{ +FdMigrationState *s = opaque; +int ret = -1; + +if (ft_mode != FT_TRANSACTION_RECV) { +fprintf(stderr, +migrate_ft_trans_get_ready: invalid ft_mode %d\n, ft_mode); +goto error_out; +} + +/* flush and check if events are remaining */ +vm_start(); +ret = event_tap_flush_one(); +if (ret 0) { +fprintf(stderr, event_tap_flush_one failed\n); +goto error_out; +} + +if (ret) { +ft_mode = FT_TRANSACTION_BEGIN; +} else { +ft_mode = FT_TRANSACTION_ATOMIC; + +ret = migrate_ft_trans_commit(s); +if (ret 0) { +goto error_out; +} +if (ret) { +goto out; +} +} + +vm_start(); +ret = 0; +goto out; + +error_out: +migrate_ft_trans_error(s); + +out: +return ret; +} + +static int migrate_ft_trans_put_ready(void) +{ +FdMigrationState *s = migrate_to_fms(current_migration); +
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Regards, Anthony Liguori Compression perhaps not, but if you choose compression, then performance is not your top consideration. That's the case with filesystems that support compression as well.
[Qemu-devel] [PATCH 04/18] qemu-char: export socket_set_nodelay().
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- qemu-char.c |2 +- qemu_socket.h |1 + 2 files changed, 2 insertions(+), 1 deletions(-) diff --git a/qemu-char.c b/qemu-char.c index bd4e944..c4f1940 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2111,7 +2111,7 @@ static void tcp_chr_telnet_init(int fd) send(fd, (char *)buf, 3, 0); } -static void socket_set_nodelay(int fd) +void socket_set_nodelay(int fd) { int val = 1; setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)val, sizeof(val)); diff --git a/qemu_socket.h b/qemu_socket.h index 897a8ae..b7f8465 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia); int qemu_socket(int domain, int type, int protocol); int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen); void socket_set_nonblock(int fd); +void socket_set_nodelay(int fd); int send_all(int fd, const void *buf, int len1); /* New, ipv6-ready socket helper functions, see qemu-sockets.c */ -- 1.7.1.2
[Qemu-devel] Re: virtio-serial semantics for binary data and guest agents
On 02/22/2011 10:59 PM, Amit Shah wrote: On (Tue) 22 Feb 2011 [16:40:55], Michael Roth wrote: If something in the guest is attempting to read/write from the virtio-serial device, and nothing is connected to virtio-serial's host character device (say, a socket) 1. writes will block until something connect()s, at which point the write will succeed 2. reads will always return 0 until something connect()s, at which point the reads will block until there's data This makes it difficult (impossible?) to implement the notion of connect/disconnect or open/close over virtio-serial without layering another protocol on top using hackish things like length-encoded payloads or sentinel values to determine the end of one RPC/request/response/session and the start of the next. For instance, if the host side disconnects, then reconnects before we read(), we may never get the read()=0, and our FD remains valid. Whereas with a tcp/unix socket our FD is no longer valid, and the read()=0 is an event we can check for at any point after the other end does a close/disconnect. There's SIGIO support, so host connect-disconnect notifications can be caught via the signal. I recall looking into this at some pointbut don't we get a SIGIO for read/write-ability in general? So you still need some way differentiate, say, readability from a disconnect/EOF, and the read()=0 that could determine this is still racing with host-side reconnects. Also, nonblocking reads/writes will return -EPIPE if the host-side connection is not up. But we still essentially need to poll() for a host-side disconnected state, which is still racy since they may reconnect before we've done a read/write that would've generated the -EPIPE. It seems like what we really need is for the FD to be invalid from that point forward. Also, I focused more on the guest-side connect/disconnect detection, but as Anthony mentioned I think the host side shares similar limitations as well. AFAIK once we connect to the chardev that FD remains valid until the connected process closes it, and so races with the guest side on detecting connect/disconnect events in a similar manner. For the host side it looks like virtio-console has guest_close/guest_open callbacks already that we could potentially use...seems like it's just a matter of tying them to the chardev... basically having virtio-serial's guest_close() result in a close() on the corresponding chardev connection's FD. Amit
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 03:13 AM, Kevin Wolf wrote: Am 22.02.2011 19:18, schrieb Anthony Liguori: On 02/22/2011 10:15 AM, Kevin Wolf wrote: Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. It's a minor detail, but flushing and the amount of metadata are separate points. I agree that they are separate... The dirty flag prevents metadata from being flushed to disk very often but the use of a refcount table adds additional metadata. A refcount table is definitely not required even if you claim the requirement exists for other features. I assume you mean to implement trim/discard support but instead of a refcount table, a free list would work just as well and would leave the metadata update out of the fast path (allocating writes) and instead only be in the slow path (trim/discard). ...but here you're arguing about writing metadata out in the fast path, so you're actually not interested in the amount of metadata but in the overhead of flushing it. Which is a problem that's solved. I'm interested in both. An extra write is always going to be an extra write. The flush just makes it very painful. A refcount table is essential for internal snapshots and compression, it's useful for discard and for running on block devices, it's necessary for avoiding the dirty flag and fsck on startup. No, as designed today, qcow2 still needs a dirty flag to avoid leaking blocks. These are five use cases that I can enumerate without thinking a lot about it, there might be more. You propose using three different mechanisms for allowing normal allocations (use the file size), block devices (add a size field into the header) and discard (free list), and the other three features, for which you can't think of a hack, you declare misfeatures. No, I only label compression and internal snapshots as misfeatures. Encryption is a completely reasonable feature. So even with qcow3, what's the expectation of snapshots? Are we going to scale to images with over 1000 snapshots? I believe snapshot support in qcow2 is not a feature that has been designed with any serious thought. If we truly want to support internal snapshots, let's design it correctly. As a format feature, a refcount table really only makes sense if the refcount is required to be greater than a single bit. There are more optimal data structures that can be used if the refcount of a block is fixed to 1-bit (like a free list) which is what the fundamental design difference between qcow2 and qed is. Okay, so even assuming that there's something like misfeatures that we can kick out (with which I strongly disagree), what's the crucial advantage of free lists that would make you switch the image format? Performance. One thing we haven't tested with qcow2 is O_SYNC performance in the guest but my suspicion is that an O_SYNC workload is going to perform poorly even with cache=none. Starting with a simple format that we don't have to jump through tremendous hoops to get reasonable performance out of has a lot of virtues. Regards, Anthony Liguori
Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy
On 02/23/2011 07:01 AM, Avi Kivity wrote: On 02/23/2011 01:14 AM, Anthony Liguori wrote: -drive already ties into the qemuopts infrastructure and we have readconfig and writeconfig. I don't think we're missing any major pieces to do this in a more proper fashion. The problem with qemu config files is that it splits the authoritative source of where images are stored into two. Is it in the management tool's database or is it in qemu's config file? I like to use the phrase stateful config file. To me, it's just a database for QEMU to persist data about the VM. It's the only way for QEMU to make certain transactions atomic in the face of QEMU crashing. The user visible config file is a totally different concept. A management tool launches QEMU and tells it where to keep it's state database. The management application may prepopulate the state database or it may just use an empty file. QEMU uses the state database to store information that is created dynamically. For instance, devices added through device_add. A device added via -device wouldn't necessary get added to the state database. Practically speaking, it let's you invoke QEMU with a fixed command line, while still using the monitor to make changes that would otherwise require the command line being updated. For the problem at hand, one solution is to make qemu stop after the copy, and then management can issue an additional command to rearrange the disk and resume the guest. A drawback here is that if management dies, the guest is stopped until it restarts. We also make management latency guest visible, even if it doesn't die at an inconvenient place. An alternative approach is to have the copy be performed by a new layered block format driver: - create a new image, type = live-copy, containing three pieces of information - source image - destination image - copy state (initially nothing is copied) - tell qemu switch to the new image - qemu starts copying, updates copy state as needed - copy finishes, event is emitted; reads and writes still serviced - management receives event, switches qemu to destination image - management removes live-copy image If management dies while this is happening, it can simply query the state of the copy. Similarly, if qemu dies, the copy state is persistent (could be 0/1 or real range of blocks). This is a more elegant solution to the problem than the commit problem but it's also a one-off. I think we have a generic problem here and we ought to try to solve it generically (within reason). Regards, Anthony Liguori
[Qemu-devel] Re: virtio-serial semantics for binary data and guest agents
On 02/23/2011 08:31 AM, Michael Roth wrote: On 02/22/2011 10:59 PM, Amit Shah wrote: On (Tue) 22 Feb 2011 [16:40:55], Michael Roth wrote: If something in the guest is attempting to read/write from the virtio-serial device, and nothing is connected to virtio-serial's host character device (say, a socket) 1. writes will block until something connect()s, at which point the write will succeed 2. reads will always return 0 until something connect()s, at which point the reads will block until there's data This makes it difficult (impossible?) to implement the notion of connect/disconnect or open/close over virtio-serial without layering another protocol on top using hackish things like length-encoded payloads or sentinel values to determine the end of one RPC/request/response/session and the start of the next. For instance, if the host side disconnects, then reconnects before we read(), we may never get the read()=0, and our FD remains valid. Whereas with a tcp/unix socket our FD is no longer valid, and the read()=0 is an event we can check for at any point after the other end does a close/disconnect. There's SIGIO support, so host connect-disconnect notifications can be caught via the signal. I recall looking into this at some pointbut don't we get a SIGIO for read/write-ability in general? So you still need some way differentiate, say, readability from a disconnect/EOF, and the read()=0 that could determine this is still racing with host-side reconnects. Also, nonblocking reads/writes will return -EPIPE if the host-side connection is not up. But we still essentially need to poll() for a host-side disconnected heh, just poll not poll() sorry state, which is still racy since they may reconnect before we've done a read/write that would've generated the -EPIPE. It seems like what we really need is for the FD to be invalid from that point forward. Also, I focused more on the guest-side connect/disconnect detection, but as Anthony mentioned I think the host side shares similar limitations as well. AFAIK once we connect to the chardev that FD remains valid until the connected process closes it, and so races with the guest side on detecting connect/disconnect events in a similar manner. For the host side it looks like virtio-console has guest_close/guest_open callbacks already that we could potentially use...seems like it's just a matter of tying them to the chardev... basically having virtio-serial's guest_close() result in a close() on the corresponding chardev connection's FD. Amit
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 23.02.2011 15:23, schrieb Anthony Liguori: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it Define reasonable. I sent you some numbers not too long for encryption, and I consider them reasonable (iirc, between 25% and 40% slower than without encryption). existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. What are the points where you think that performance of internal snapshots suffers? The argument that I would understand is that internal snapshots are probably not as handy in all scenarios. Kevin
Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?
On Wednesday 23 February 2011, Gerhard Wiesinger wrote: After some further tests and looking at the iproute ip and kernel code I finally gave up because I thing such a setup it is not possible without breaking up/reconfiguring eth0. When I have to reconfigure eth0 I think a better approach is to configure a bridge which I finally did and works well. I tried to explain/document the macvtap/macvlan concepts and limitations below. Please comment on it whether this is true or false. macvtap/macvlan driver concepts and limitations: 1.) macvlan driver adds a MAC address to a lower interface device where the actual macvlanx device is based on 2.) macvtap driver is based on macvlan driver and macvtap driver adds additional functionality of interface = external program communication with stdin/stdout channel. 3.) Limitations: macvtap/macvlan based devices can only communicate with childs based on the same lower device (e.g. eth0 in this sample) but not to the lower device itself, only to the outside world of the interface Correct. finally this makes the macvlan/macvtap approach useless because main eth0 interface must still be broken in the chain and reconfigured which was against the requirements that eth0 should not be touched and reconfigured! Yes, that is unfortunate, but it's the same that you'd get with a bridge device: When you have a bridge on top of eth0, you can no longer assign an IP address to eth0 and let it communicate with the virtual ports on the bridge. You need to instead set the IP address on the bridge itself. Macvlan is slightly better because it allows you to have multiple host devices that can each have their own MAC/IP address, unlike the bridge, but of course it can not be connected to anything else besides macvlan or macvtap ports. Arnd
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 23.02.2011 15:21, schrieb Anthony Liguori: On 02/23/2011 03:13 AM, Kevin Wolf wrote: Am 22.02.2011 19:18, schrieb Anthony Liguori: On 02/22/2011 10:15 AM, Kevin Wolf wrote: Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. It's a minor detail, but flushing and the amount of metadata are separate points. I agree that they are separate... The dirty flag prevents metadata from being flushed to disk very often but the use of a refcount table adds additional metadata. A refcount table is definitely not required even if you claim the requirement exists for other features. I assume you mean to implement trim/discard support but instead of a refcount table, a free list would work just as well and would leave the metadata update out of the fast path (allocating writes) and instead only be in the slow path (trim/discard). ...but here you're arguing about writing metadata out in the fast path, so you're actually not interested in the amount of metadata but in the overhead of flushing it. Which is a problem that's solved. I'm interested in both. An extra write is always going to be an extra write. The flush just makes it very painful. One extra write of 64k every 2 GB. Hardly relevant. A refcount table is essential for internal snapshots and compression, it's useful for discard and for running on block devices, it's necessary for avoiding the dirty flag and fsck on startup. No, as designed today, qcow2 still needs a dirty flag to avoid leaking blocks. I know that this is your opinion and I do respect that, this is one of the reasons why there is the suggestion to add the dirty flag for you. On the other hand, it would be about time for you to accept that there are people who think differently about it and who don't want the same as you. This is why using the dirty flag should be optional. These are five use cases that I can enumerate without thinking a lot about it, there might be more. You propose using three different mechanisms for allowing normal allocations (use the file size), block devices (add a size field into the header) and discard (free list), and the other three features, for which you can't think of a hack, you declare misfeatures. No, I only label compression and internal snapshots as misfeatures. Encryption is a completely reasonable feature. I didn't even mention encryption. It's obvious that it's a reasonable feature and not a misfeature, because it fits relatively easily in your QED design. :-) The three features you don't like because they don't fit are compression, internal snapshots and not having to fsck (thanks for proving the latter above) So even with qcow3, what's the expectation of snapshots? Are we going to scale to images with over 1000 snapshots? I believe snapshot support in qcow2 is not a feature that has been designed with any serious thought. If we truly want to support internal snapshots, let's design it correctly. So what would be the key differences between your design and qcow2's? We can always check if there's room to improve. As a format feature, a refcount table really only makes sense if the refcount is required to be greater than a single bit. There are more optimal data structures that can be used if the refcount of a block is fixed to 1-bit (like a free list) which is what the fundamental design difference between qcow2 and qed is. Okay, so even assuming that there's something like misfeatures that we can kick out (with which I strongly disagree), what's the crucial advantage of free lists that would make you switch the image format? Performance. One thing we haven't tested with qcow2 is O_SYNC performance in the guest but my suspicion is that an O_SYNC workload is going to perform poorly even with cache=none. But wasn't it you who wants to use the dirty flag in any case? The refcounts aren't even written then. Starting with a simple format that we don't have to jump through tremendous hoops to get reasonable performance out of has a lot of virtues. I know that you don't mean it like I read this, but it's entirely true: You're _starting_ with a simple
Re: [Qemu-devel] [PATCH 1/7] usb-ccid: add CCID bus
On Tue, Feb 22, 2011 at 10:03:45AM -0600, Anthony Liguori wrote: On 02/07/2011 10:34 AM, Alon Levy wrote: +static int ccid_post_load(void *opaque, int version_id) +{ +USBCCIDState *s = opaque; + +// This must be done after usb_device_attach, which sets state to ATTACHED, +// while it must be DEFAULT in order to accept packets (like it is after +// reset, but reset will reset our addr and call our reset handler which +// may change state, and we don't want to do that when migrating). +s-dev.state = s-state_vmstate; +return 0; +} + +static void ccid_pre_save(void *opaque) +{ +USBCCIDState *s = opaque; + +s-state_vmstate = s-dev.state; +if (s-dev.attached) { +// migrating an open device, ignore reconnection CHR_EVENT to avoid an +// erronous detach. +s-migration_state = MIGRATION_MIGRATED; +} Still using C99 comments, should be C89. Fixed (by checkpatch.pl iteration) in v20. Regards, Anthony Liguori
[Qemu-devel] [PATCH 0/3] tcg: Support debugging leakage of temporaries
This patchset removes the ad-hoc debug code in target-arm for identifying cases where we leaked TCG temporary variables, in favour of an implementation in tcg itself. Generally any temporaries created by a target while it is translating an instruction should be freed by the end of that instruction; otherwise carefully crafted guest code could cause TCG to run out of temporaries and assert. Putting the leak-debugging code into TCG proper (a) makes more sense as this isn't at all arm-specific (b) makes it more comprehensive, as it now covers temporaries created in all ways, not just via the new_tmp()/dead_tmp() wrapper functions (c) avoids annoying false positives where eg a TCG temp created with tcg_const_i32() was passed to dead_tmp(). The tracking only happens if qemu was configured with --enable-debug-tcg. It should be easy to add to other targets if desired; it's just a matter of calling tcg_clear_temp_count() and tcg_check_temp_count() in the appropriate places. Peter Maydell (3): tcg: Add support for debugging leakage of temporaries target-arm: Remove ad-hoc leak checking code target-arm: Use TCG temporary leak debugging facilities target-arm/translate.c | 705 +++ tcg/tcg.c | 32 +++ tcg/tcg.h | 17 ++ 3 files changed, 394 insertions(+), 360 deletions(-)
[Qemu-devel] [PATCH 3/3] target-arm: Use TCG temporary leak debugging facilities
Use the new TCG temporary leak debugging facilities to check that each ARM instruction does not leak temporaries. Signed-off-by: Peter Maydell peter.mayd...@linaro.org --- target-arm/translate.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/target-arm/translate.c b/target-arm/translate.c index 31067d5..b96a136 100644 --- a/target-arm/translate.c +++ b/target-arm/translate.c @@ -9125,6 +9125,8 @@ static inline void gen_intermediate_code_internal(CPUState *env, gen_icount_start(); +tcg_clear_temp_count(); + /* A note on handling of the condexec (IT) bits: * * We want to avoid the overhead of having to write the updated condexec @@ -9234,6 +9236,11 @@ static inline void gen_intermediate_code_internal(CPUState *env, gen_set_label(dc-condlabel); dc-condjmp = 0; } + +if (tcg_check_temp_count()) { +fprintf(stderr, TCG temporary leak before %08x\n, dc-pc); +} + /* Translation stops when a conditional branch is encountered. * Otherwise the subsequent code could get translated several times. * Also stop translation when a page boundary is reached. This -- 1.7.1
[Qemu-devel] [PATCH 1/3] tcg: Add support for debugging leakage of temporaries
Add support (if CONFIG_DEBUG_TCG is defined) for debugging leakage of temporary variables. Generally any temporaries created by a target while it is translating an instruction should be freed by the end of that instruction; otherwise carefully crafted guest code could cause TCG to run out of temporaries and assert. By calling tcg_check_temp_count() after each instruction we can check that we are not leaking temporaries in this way. Signed-off-by: Peter Maydell peter.mayd...@linaro.org --- tcg/tcg.c | 32 tcg/tcg.h | 17 + 2 files changed, 49 insertions(+), 0 deletions(-) diff --git a/tcg/tcg.c b/tcg/tcg.c index 5dd6a2c..8748c05 100644 --- a/tcg/tcg.c +++ b/tcg/tcg.c @@ -450,6 +450,10 @@ static inline int tcg_temp_new_internal(TCGType type, int temp_local) s-nb_temps++; } } + +#if defined(CONFIG_DEBUG_TCG) +s-temps_in_use++; +#endif return idx; } @@ -475,6 +479,13 @@ static inline void tcg_temp_free_internal(int idx) TCGTemp *ts; int k; +#if defined(CONFIG_DEBUG_TCG) +s-temps_in_use--; +if (s-temps_in_use 0) { +fprintf(stderr, More temporaries freed than allocated!\n); +} +#endif + assert(idx = s-nb_globals idx s-nb_temps); ts = s-temps[idx]; assert(ts-temp_allocated != 0); @@ -528,6 +539,27 @@ TCGv_i64 tcg_const_local_i64(int64_t val) return t0; } +#if defined(CONFIG_DEBUG_TCG) +void tcg_clear_temp_count(void) +{ +TCGContext *s = tcg_ctx; +s-temps_in_use = 0; +} + +int tcg_check_temp_count(void) +{ +TCGContext *s = tcg_ctx; +if (s-temps_in_use) { +/* Clear the count so that we don't give another + * warning immediately next time around. + */ +s-temps_in_use = 0; +return 1; +} +return 0; +} +#endif + void tcg_register_helper(void *func, const char *name) { TCGContext *s = tcg_ctx; diff --git a/tcg/tcg.h b/tcg/tcg.h index e1afde2..5538db9 100644 --- a/tcg/tcg.h +++ b/tcg/tcg.h @@ -323,6 +323,10 @@ struct TCGContext { int64_t restore_count; int64_t restore_time; #endif + +#ifdef CONFIG_DEBUG_TCG +int temps_in_use; +#endif }; extern TCGContext tcg_ctx; @@ -392,6 +396,19 @@ static inline TCGv_i64 tcg_temp_local_new_i64(void) void tcg_temp_free_i64(TCGv_i64 arg); char *tcg_get_arg_str_i64(TCGContext *s, char *buf, int buf_size, TCGv_i64 arg); +#if defined(CONFIG_DEBUG_TCG) +/* If you call tcg_clear_temp_count() at the start of a section of + * code which is not supposed to leak any TCG temporaries, then + * calling tcg_check_temp_count() at the end of the section will + * return 1 if the section did in fact leak a temporary. + */ +void tcg_clear_temp_count(void); +int tcg_check_temp_count(void); +#else +#define tcg_clear_temp_count() +#define tcg_check_temp_count() 0 +#endif + void tcg_dump_info(FILE *f, fprintf_function cpu_fprintf); #define TCG_CT_ALIAS 0x80 -- 1.7.1
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 08:38 AM, Kevin Wolf wrote: Am 23.02.2011 15:23, schrieb Anthony Liguori: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it Define reasonable. I sent you some numbers not too long for encryption, and I consider them reasonable (iirc, between 25% and 40% slower than without encryption). I was really referring to snapshots. I have absolutely no doubt that encryption can be implemented with a reasonable performance overhead. existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. I don't think there's a user expectation of backing file chains of 1000 files performing well. However, I've talked to a number of customers that have been interested in using internal snapshots for checkpointing which would involve a large number of snapshots. In fact, Fabrice originally added qcow2 because he was interested in doing reverse debugging. The idea of internal snapshots was to store a high number of checkpoints to allow reverse debugging to be optimized. I think the way snapshot metadata is stored makes this not realistic since they're stored in more or less a linear array. I think to really support a high number of snapshots, you'd want to store a hash with each block that contained a refcount 1. I think you quickly end up reinventing btrfs though in the process. Regards, Anthony Liguori What are the points where you think that performance of internal snapshots suffers? The argument that I would understand is that internal snapshots are probably not as handy in all scenarios. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote: On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. IMHO dm-crypt isn't a generally usable alternative to native built in encryption in qcow2. It isn't usable at all by non-root. If you want to use with plain files, then you need to turn the file into a loopback device and then layer in dm-crypt. It is generally just a PITA to manage. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy
On 02/23/2011 04:35 PM, Anthony Liguori wrote: On 02/23/2011 07:01 AM, Avi Kivity wrote: On 02/23/2011 01:14 AM, Anthony Liguori wrote: -drive already ties into the qemuopts infrastructure and we have readconfig and writeconfig. I don't think we're missing any major pieces to do this in a more proper fashion. The problem with qemu config files is that it splits the authoritative source of where images are stored into two. Is it in the management tool's database or is it in qemu's config file? I like to use the phrase stateful config file. To me, it's just a database for QEMU to persist data about the VM. It's the only way for QEMU to make certain transactions atomic in the face of QEMU crashing. The user visible config file is a totally different concept. A management tool launches QEMU and tells it where to keep it's state database. The management application may prepopulate the state database or it may just use an empty file. In that case the word 'config' is misleading. To me, it implies that the user configures something, and qemu reads it, not something mostly internal to qemu. Qemu does keep state. Currently only images, but in theory also the on-board NVRAM. QEMU uses the state database to store information that is created dynamically. For instance, devices added through device_add. A device added via -device wouldn't necessary get added to the state database. Practically speaking, it let's you invoke QEMU with a fixed command line, while still using the monitor to make changes that would otherwise require the command line being updated. Then the invoker quickly loses track of what the actual state is. It can't just remember which commands it issued (presumably in response to the user updating user visible state). It has to parse the stateful config file qemu outputs. But at which points should it parse it? I don't think it's reasonable to have three different ways to interact with qemu, all needed: the command line, reading and writing the stateful config file, and the monitor. I'd rather push for starting qemu with a blank guest and assembling (cold-plugging) all the hardware via the monitor before starting the guest. For the problem at hand, one solution is to make qemu stop after the copy, and then management can issue an additional command to rearrange the disk and resume the guest. A drawback here is that if management dies, the guest is stopped until it restarts. We also make management latency guest visible, even if it doesn't die at an inconvenient place. An alternative approach is to have the copy be performed by a new layered block format driver: - create a new image, type = live-copy, containing three pieces of information - source image - destination image - copy state (initially nothing is copied) - tell qemu switch to the new image - qemu starts copying, updates copy state as needed - copy finishes, event is emitted; reads and writes still serviced - management receives event, switches qemu to destination image - management removes live-copy image If management dies while this is happening, it can simply query the state of the copy. Similarly, if qemu dies, the copy state is persistent (could be 0/1 or real range of blocks). This is a more elegant solution to the problem than the commit problem but it's also a one-off. I think we have a generic problem here and we ought to try to solve it generically (within reason). Can you give more examples? I think I demonstrated that hot-plug can be solved via the existing interfaces. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 09:23 AM, Avi Kivity wrote: On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? Checkpointing. It was the original use-case that led to qcow2 being invented. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:29 PM, Anthony Liguori wrote: existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. I don't think there's a user expectation of backing file chains of 1000 files performing well. However, I've talked to a number of customers that have been interested in using internal snapshots for checkpointing which would involve a large number of snapshots. In fact, Fabrice originally added qcow2 because he was interested in doing reverse debugging. The idea of internal snapshots was to store a high number of checkpoints to allow reverse debugging to be optimized. I don't see how that works, since the memory image is duplicated for each snapshot. So thousands of snapshots = terabytes of storage, and hours of creating the snapshots. Migrate-to-file with block live migration, or even better, something based on Kemari would be a lot faster. I think the way snapshot metadata is stored makes this not realistic since they're stored in more or less a linear array. I think to really support a high number of snapshots, you'd want to store a hash with each block that contained a refcount 1. I think you quickly end up reinventing btrfs though in the process. Can you elaborate? What's the problem with a linear array of snapshots (say up to 10,000 snapshots)? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:33 PM, Daniel P. Berrange wrote: On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote: On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. IMHO dm-crypt isn't a generally usable alternative to native built in encryption in qcow2. It isn't usable at all by non-root. If you want to use with plain files, then you need to turn the file into a loopback device and then layer in dm-crypt. It is generally just a PITA to manage. I wasn't suggesting dm-crypt is a replacement for qcow2 encyption, just that it shows that block-level encryption can be done with reasonable overhead. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:31 PM, Anthony Liguori wrote: what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? Checkpointing. It was the original use-case that led to qcow2 being invented. I still don't see. What would you do with thousands of checkpoints? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 09:36 AM, Avi Kivity wrote: On 02/23/2011 05:29 PM, Anthony Liguori wrote: existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. I don't think there's a user expectation of backing file chains of 1000 files performing well. However, I've talked to a number of customers that have been interested in using internal snapshots for checkpointing which would involve a large number of snapshots. In fact, Fabrice originally added qcow2 because he was interested in doing reverse debugging. The idea of internal snapshots was to store a high number of checkpoints to allow reverse debugging to be optimized. I don't see how that works, since the memory image is duplicated for each snapshot. So thousands of snapshots = terabytes of storage, and hours of creating the snapshots. Fabrice wanted to use CoW to as a mechanism to deduplicate the memory contents with the on-disk state specifically to address this problem. For the longest time, there was a comment in the savevm code along these lines. It might still be there. I think the lack of on-disk hashes was a critical missing bit to make this feature really work well. Migrate-to-file with block live migration, or even better, something based on Kemari would be a lot faster. I think the way snapshot metadata is stored makes this not realistic since they're stored in more or less a linear array. I think to really support a high number of snapshots, you'd want to store a hash with each block that contained a refcount 1. I think you quickly end up reinventing btrfs though in the process. Can you elaborate? What's the problem with a linear array of snapshots (say up to 10,000 snapshots)? Lots of things. The array will start to consume quite a bit of contiguous space as it gets larger which means it needs to be relocated. Deleting a snapshot is a far more expensive operation than it needs to be. Regards, Anthony Liguori