Re: [Qemu-devel] [PULL v2 46/49] qdev: Add enum property types to QAPI schema
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Il 17/02/2014 23:30, Eric Blake ha scritto: | Oops - we lost the ## header before add_client. Sounds like a | followup patch might be called for, unless you are okay tweaking | the pull request. Yeah, I can do that with qemu-trivial. Paolo -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJTAxRgAAoJEBvWZb6bTYby+0kP/jo/84UmFlN6zJssfed++kkm tffNBa2Aom2oWiPiUv80J8sstiL7QRI44szHrnj9IVew8q/iSuIjHKQSvG9NvdE+ zEB1XKqcm4dlXOpVZ+Jq4w9g2K7BLmtoEJY75oZEE/u8731EI1IMVqmzGJ1dBYsh /XDS9drIsHfZvDNcqTtRdY7ZhFP4QZo/UrDcX4vHx8ak6QcGWllfTn8sUGmkV3rS p643v54WCwE+ZhTtfWEHwhTU7c0tUZzr+UxGXZ6bumJ7CXCdgeZltYkqxUc/qOlb Fg9HEZiFaDxGDtI2IzJu4G2Y67j7twM8h/USSkx1lUqaQNQ2wC7nLDZ3GsnkY8XY 450DfVKfBAN7bAFf8jVT+h2D/AgfX1qYsKrozgOGppxQRmIhU0QlPFm/1/V9paCX 0Nje6oG+89XMUl3KKstyXzKiKGQSnETu2iyVMpeLcQyDUZWdst0c7yh4yME3giIH sS8354qcdOs2VYE02z7ceRU7M3vwv6sfuGyrN15oeFI/x42CTCseH2B8Unvn815F Da/hHNqNy58PTgVbMybz/sxYRBKBl8euTvQuxjLyg+2SvOJUtvK040lRuVk9YwTO Q0RmnVU5bXkO8A27l34zccnekLAbIJVgJKLVzB9tpGdONNLsUuXXwgtfP2Kw+IVI I0fyhNi64xs6LCER/QI6 =FzxL -END PGP SIGNATURE-
Re: [Qemu-devel] [PATCH v20 18/26] sheepdog.c: replace QEMUOptionParameter with QemuOpts
2014-02-18 3:01 GMT+08:00 Eric Blake ebl...@redhat.com: On 02/11/2014 11:33 PM, Chunyan Liu wrote: sheepdog.c: replace QEMUOptionParameter with QemuOpts Signed-off-by: Dong Xu Wang wdon...@linux.vnet.ibm.com Signed-off-by: Chunyan Liu cy...@suse.com --- block/sheepdog.c | 101 +- 1 files changed, 47 insertions(+), 54 deletions(-) + +buf = NULL; +buf = qemu_opt_get_del(opts, BLOCK_OPT_REDUNDANCY); Dead NULL assignment. +if (buf) { +ret = parse_redundancy(s, buf); +if (ret 0) { +goto out; } Do you need to store into ret here, or is it sufficient to use 'if (parse_redundancy(s, buf) 0) {' It followed the logic in existing code. If function ret 0, return with that return value. Same logic in earlier codes of the function. if (strstr(filename, ://)) { ret = sd_parse_uri(s, filename, s-name, snapid, tag); } else { ret = parse_vdiname(s, filename, s-name, snapid, tag); } if (ret 0) { goto out; } -{ -.name = BLOCK_OPT_REDUNDANCY, -.type = OPT_STRING, -.help = Redundancy of the image -}, -{ NULL } +static QemuOptsList sd_create_opts = { +.name = sheepdog-create-opts, +.head = QTAILQ_HEAD_INITIALIZER(sd_create_opts.head), +.desc = { +{ +.name = BLOCK_OPT_PREALLOC, +.type = QEMU_OPT_STRING, +.help = Preallocation mode (allowed values: off, full) +}, +{ /* end of list */ } Missing redundancy in the new list. @@ -2538,7 +2533,7 @@ static BlockDriver bdrv_sheepdog = { .bdrv_save_vmstate = sd_save_vmstate, .bdrv_load_vmstate = sd_load_vmstate, -.create_options = sd_create_options, +.create_opts = sd_create_opts, Another example of the inconsistent use of . }; static BlockDriver bdrv_sheepdog_tcp = { @@ -2548,7 +2543,7 @@ static BlockDriver bdrv_sheepdog_tcp = { .bdrv_needs_filename = true, .bdrv_file_open = sd_open, .bdrv_close = sd_close, -.bdrv_create= sd_create, +.bdrv_create2 = sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, .bdrv_getlength = sd_getlength, .bdrv_get_allocated_file_size = sd_get_allocated_file_size, @@ -2568,7 +2563,6 @@ static BlockDriver bdrv_sheepdog_tcp = { .bdrv_save_vmstate = sd_save_vmstate, .bdrv_load_vmstate = sd_load_vmstate, -.create_options = sd_create_options, }; Why no .create_opts replacement? -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org
[Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing
From: Michael R. Hines mrhi...@us.ibm.com Changes since v1: 1. Re-based against Juan's improved migration_bitmap performance changes 2. Overhauled RDMA support to prepare for better usage of RDMA in other parts of the QEMU code base (such as storage). 3. Fix for netlink issues that failed to cleanup the network buffer device for development testing. Michael R. Hines (12): mc: add documentation for micro-checkpointing mc: timestamp migration_bitmap and KVM logdirty usage mc: introduce a 'checkpointing' status check into the VCPU states mc: support custom page loading and copying rdma: accelerated memcpy() support and better external RDMA user interfaces mc: introduce state machine changes for MC mc: introduce additional QMP statistics for micro-checkpointing mc: core logic mc: configure and makefile support mc: expose tunable parameter for checkpointing frequency mc: introduce new capabilities to control micro-checkpointing mc: activate and use MC if requested Makefile.objs |1 + arch_init.c | 72 +- configure | 45 + cpus.c|9 +- docs/mc.txt | 222 hmp-commands.hx | 16 +- hmp.c | 23 + hmp.h |1 + include/migration/migration.h | 70 +- include/migration/qemu-file.h | 55 +- migration-checkpoint.c| 1565 + migration-rdma.c | 2605 +++-- migration.c | 156 ++- qapi-schema.json | 86 +- qemu-file.c | 80 +- qmp-commands.hx | 23 + vl.c |9 + 17 files changed, 4097 insertions(+), 941 deletions(-) create mode 100644 docs/mc.txt create mode 100644 migration-checkpoint.c -- 1.8.1.2
[Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage
From: Michael R. Hines mrhi...@us.ibm.com We also later export these statistics over QMP for better monitoring of micro-checkpointing as the workload changes. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- arch_init.c | 34 -- 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/arch_init.c b/arch_init.c index 80574a0..b8364b0 100644 --- a/arch_init.c +++ b/arch_init.c @@ -193,6 +193,8 @@ typedef struct AccountingInfo { uint64_t skipped_pages; uint64_t norm_pages; uint64_t iterations; +uint64_t log_dirty_time; +uint64_t migration_bitmap_time; uint64_t xbzrle_bytes; uint64_t xbzrle_pages; uint64_t xbzrle_cache_miss; @@ -201,7 +203,7 @@ typedef struct AccountingInfo { static AccountingInfo acct_info; -static void acct_clear(void) +void acct_clear(void) { memset(acct_info, 0, sizeof(acct_info)); } @@ -236,6 +238,16 @@ uint64_t norm_mig_pages_transferred(void) return acct_info.norm_pages; } +uint64_t norm_mig_log_dirty_time(void) +{ +return acct_info.log_dirty_time; +} + +uint64_t norm_mig_bitmap_time(void) +{ +return acct_info.migration_bitmap_time; +} + uint64_t xbzrle_mig_bytes_transferred(void) { return acct_info.xbzrle_bytes; @@ -426,27 +438,35 @@ static void migration_bitmap_sync(void) static int64_t num_dirty_pages_period; int64_t end_time; int64_t bytes_xfer_now; +int64_t begin_time; +int64_t dirty_time; if (!bytes_xfer_prev) { bytes_xfer_prev = ram_bytes_transferred(); } +begin_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); if (!start_time) { start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); } - trace_migration_bitmap_sync_start(); address_space_sync_dirty_bitmap(address_space_memory); +dirty_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + QTAILQ_FOREACH(block, ram_list.blocks, next) { migration_bitmap_sync_range(block-mr-ram_addr, block-length); } + trace_migration_bitmap_sync_end(migration_dirty_pages - num_dirty_pages_init); num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init; end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); -/* more than 1 second = 1000 millisecons */ +acct_info.log_dirty_time += dirty_time - begin_time; +acct_info.migration_bitmap_time += end_time - dirty_time; + +/* more than 1 second = 1000 milliseconds */ if (end_time start_time + 1000) { if (migrate_auto_converge()) { /* The following detection logic can be refined later. For now: @@ -548,9 +568,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage) /* XBZRLE overflow or normal page */ if (bytes_sent == -1) { bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE); -qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE); -bytes_sent += TARGET_PAGE_SIZE; -acct_info.norm_pages++; +if (ret != RAM_SAVE_CONTROL_DELAYED) { +qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE); +bytes_sent += TARGET_PAGE_SIZE; +acct_info.norm_pages++; +} } /* if page is unmodified, continue to the next */ -- 1.8.1.2
[Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing
From: Michael R. Hines mrhi...@us.ibm.com MC provides a lot of new information, including the same RAM statistics that ordinary migration does, so we centralize a lot of that printing code into a common function so that the QMP printing statements don't get duplicated too much. We also introduce a new MCStats structure (like MigrationStats) due to the large number of non-migration related statistics - don't want to confuse migration and MC too much, so let's keep them separate for now. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- hmp.c | 17 + include/migration/migration.h | 6 +++ migration.c | 86 ++- qapi-schema.json | 33 + 4 files changed, 109 insertions(+), 33 deletions(-) diff --git a/hmp.c b/hmp.c index 1af0809..edf062e 100644 --- a/hmp.c +++ b/hmp.c @@ -203,6 +203,23 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict) info-disk-total 10); } +if (info-has_mc) { +monitor_printf(mon, checkpoints: % PRIu64 \n, + info-mc-checkpoints); +monitor_printf(mon, xmit_time: % PRIu64 ms\n, + info-mc-xmit_time); +monitor_printf(mon, log_dirty_time: % PRIu64 ms\n, + info-mc-log_dirty_time); +monitor_printf(mon, migration_bitmap_time: % PRIu64 ms\n, + info-mc-migration_bitmap_time); +monitor_printf(mon, ram_copy_time: % PRIu64 ms\n, + info-mc-ram_copy_time); +monitor_printf(mon, copy_mbps: %0.2f mbps\n, + info-mc-copy_mbps); +monitor_printf(mon, throughput: %0.2f mbps\n, + info-mc-mbps); +} + if (info-has_xbzrle_cache) { monitor_printf(mon, cache size: % PRIu64 bytes\n, info-xbzrle_cache-cache_size); diff --git a/include/migration/migration.h b/include/migration/migration.h index e876a2c..f18ff5e 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -53,14 +53,20 @@ struct MigrationState int state; MigrationParams params; double mbps; +double copy_mbps; int64_t total_time; int64_t downtime; int64_t expected_downtime; +int64_t xmit_time; +int64_t ram_copy_time; +int64_t log_dirty_time; +int64_t bitmap_time; int64_t dirty_pages_rate; int64_t dirty_bytes_rate; bool enabled_capabilities[MIGRATION_CAPABILITY_MAX]; int64_t xbzrle_cache_size; int64_t setup_time; +int64_t checkpoints; }; void process_incoming_migration(QEMUFile *f); diff --git a/migration.c b/migration.c index f42dae4..0ccbeaa 100644 --- a/migration.c +++ b/migration.c @@ -59,7 +59,6 @@ MigrationState *migrate_get_current(void) .state = MIG_STATE_NONE, .bandwidth_limit = MAX_THROTTLE, .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE, -.mbps = -1, }; return current_migration; @@ -173,6 +172,31 @@ static void get_xbzrle_cache_stats(MigrationInfo *info) } } +static void get_ram_stats(MigrationState *s, MigrationInfo *info) +{ +info-has_total_time = true; +info-total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +- s-total_time; + +info-has_ram = true; +info-ram = g_malloc0(sizeof(*info-ram)); +info-ram-transferred = ram_bytes_transferred(); +info-ram-total = ram_bytes_total(); +info-ram-duplicate = dup_mig_pages_transferred(); +info-ram-skipped = skipped_mig_pages_transferred(); +info-ram-normal = norm_mig_pages_transferred(); +info-ram-normal_bytes = norm_mig_bytes_transferred(); +info-ram-mbps = s-mbps; + +if (blk_mig_active()) { +info-has_disk = true; +info-disk = g_malloc0(sizeof(*info-disk)); +info-disk-transferred = blk_mig_bytes_transferred(); +info-disk-remaining = blk_mig_bytes_remaining(); +info-disk-total = blk_mig_bytes_total(); +} +} + MigrationInfo *qmp_query_migrate(Error **errp) { MigrationInfo *info = g_malloc0(sizeof(*info)); @@ -199,26 +223,8 @@ MigrationInfo *qmp_query_migrate(Error **errp) info-has_setup_time = true; info-setup_time = s-setup_time; -info-has_ram = true; -info-ram = g_malloc0(sizeof(*info-ram)); -info-ram-transferred = ram_bytes_transferred(); -info-ram-remaining = ram_bytes_remaining(); -info-ram-total = ram_bytes_total(); -info-ram-duplicate = dup_mig_pages_transferred(); -info-ram-skipped = skipped_mig_pages_transferred(); -info-ram-normal = norm_mig_pages_transferred(); -info-ram-normal_bytes = norm_mig_bytes_transferred(); +get_ram_stats(s, info); info-ram-dirty_pages_rate = s-dirty_pages_rate; -info-ram-mbps = s-mbps; - -if (blk_mig_active()) { -info-has_disk = true; -
[Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
From: Michael R. Hines mrhi...@us.ibm.com Wiki: http://wiki.qemu.org/Features/MicroCheckpointing Github: g...@github.com:hinesmr/qemu.git, 'mc' branch NOTE: This is a direct copy of the QEMU wiki page for the convenience of the review process. Since this series very much in flux, instead of maintaing two copies of documentation in two different formats, this documentation will be properly formatted in the future when the review process has completed. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- docs/mc.txt | 222 1 file changed, 222 insertions(+) create mode 100644 docs/mc.txt diff --git a/docs/mc.txt b/docs/mc.txt new file mode 100644 index 000..5d4b5fe --- /dev/null +++ b/docs/mc.txt @@ -0,0 +1,222 @@ +Micro Checkpointing Specification v1 +== +Wiki: http://wiki.qemu.org/Features/MicroCheckpointing +Github: g...@github.com:hinesmr/qemu.git, 'mc' branch + +Copyright (C) 2014 Michael R. Hines mrhi...@us.ibm.com + +Contents +1 Summary +1.1 Contact +1.2 Introduction +2 The Micro-Checkpointing Process +2.1 Basic Algorithm +2.2 I/O buffering +2.3 Failure Recovery +3 Optimizations +3.1 Memory Management +3.2 RDMA Integration +4 Usage +4.1 BEFORE Running +4.2 Running +5 Performance +6 TODO +7 FAQ / Frequently Asked Questions +7.1 What happens if a failure occurs in the *middle* of a flush of the network buffer? +7.2 What's different about this implementation? +Summary +This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: Continuous Replication or Fault Tolerance or 100 other different names - choose your poison. + +Contact +Name: Michael Hines +Email: mrhi...@us.ibm.com +Wiki: http://wiki.qemu.org/Features/MicroCheckpointing + +Github: http://github.com/hinesmr/qemu.git, 'mc' branch + +Libvirt Support: http://github.com/hinesmr/libvirt.git, 'mc' branch + +Copyright (C) 2014 IBM Michael R. Hines mrhi...@us.ibm.com + +Introduction +Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a running virtual machine (VM) with little or no runtime assistance from the guest kernel or guest application software. Furthermore, Fault Tolerance is one method of providing high availability to a VM such that, from the perspective of the outside world (clients, devices, and neighboring VMs that may be paired with it), the VM and its applications have not lost any runtime state in the event of either a failure of the hypervisor/hardware to allow the VM to make forward progress or a complete loss of power. This mechanism for providing fault tolerance does *not* provide any protection whatsoever against software-level faults in the guest kernel or applications. In fact, due to the potentially extended lifetime of the VM because of this type of high availability, such software-level bugs may in fact manifest themselves more often than they otherwise ordinarily would, in which case you would need to empl! oy other + +This implementation is also fully compatible with RDMA and has undergone special optimizations to suppor the use of RDMA. (See docs/rdma.txt for more details). + +The Micro-Checkpointing Process +Basic Algorithm +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a live migration that never ends. As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps: + +1. After N milliseconds, stop the VM. +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU. +4. Resume the VM immediately so that it can make forward progress. +5. Transmit the checkpoint to the destination. +6. Repeat +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally. + +I/O buffering +Additionally, a MC must include a consistent view of device I/O, particularly the network, a problem commonly referred to as output commit. This means that the outside world can not be allowed to experience duplicate state that was committed by the virtual machine after failure. This is possible because a checkpoint may diverge by N milliseconds of time and commit state while the current MC is being transmitted to the destination. + +To guard against this problem, first, we must buffer the TX output of the network (not the input) between MCs until the current MC is safely received by the destination. For example, all outbound network packets must be held at the source until the MC is transmitted. After transmission is complete, those packets can be released. Similarly, in the case of disk I/O, we must ensure that either the contents of the local disk are safely mirrored to a remote disk before completing a MC or that the output to a shared disk, such as iSCSI, is also buffered between checkpoints and then later released in the
[Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states
From: Michael R. Hines mrhi...@us.ibm.com During micro-checkpointing, the VCPUs get repeatedly paused and resumed. We need to not freak out when the VM begins micro-checkpointing. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- cpus.c| 9 - include/migration/migration.h | 21 + qapi-schema.json | 4 +++- vl.c | 7 +++ 4 files changed, 39 insertions(+), 2 deletions(-) diff --git a/cpus.c b/cpus.c index 945d85b..b876d6b 100644 --- a/cpus.c +++ b/cpus.c @@ -532,7 +532,14 @@ static int do_vm_stop(RunState state) pause_all_vcpus(); runstate_set(state); vm_state_notify(0, state); -monitor_protocol_event(QEVENT_STOP, NULL); +/* + * If MC is enabled, libvirt gets confused + * because it thinks the VM is stopped when + * its just being micro-checkpointed. + */ +if(state != RUN_STATE_CHECKPOINT_VM) { +monitor_protocol_event(QEVENT_STOP, NULL); +} } bdrv_drain_all(); diff --git a/include/migration/migration.h b/include/migration/migration.h index 3e1e6c7..9c62e2f 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -121,10 +121,31 @@ uint64_t skipped_mig_bytes_transferred(void); uint64_t skipped_mig_pages_transferred(void); uint64_t norm_mig_bytes_transferred(void); uint64_t norm_mig_pages_transferred(void); +uint64_t norm_mig_log_dirty_time(void); +uint64_t norm_mig_bitmap_time(void); uint64_t xbzrle_mig_bytes_transferred(void); uint64_t xbzrle_mig_pages_transferred(void); uint64_t xbzrle_mig_pages_overflow(void); uint64_t xbzrle_mig_pages_cache_miss(void); +void acct_clear(void); + +void migrate_set_state(MigrationState *s, int old_state, int new_state); + +enum { +MIG_STATE_ERROR = -1, +MIG_STATE_NONE, +MIG_STATE_SETUP, +MIG_STATE_CANCELLED, +MIG_STATE_CANCELLING, +MIG_STATE_ACTIVE, +MIG_STATE_CHECKPOINTING, +MIG_STATE_COMPLETED, +}; + +int mc_enable_buffering(void); +int mc_start_buffer(void); +void mc_init_checkpointer(MigrationState *s); +void mc_process_incoming_checkpoints_if_requested(QEMUFile *f); void ram_handle_compressed(void *host, uint8_t ch, uint64_t size); diff --git a/qapi-schema.json b/qapi-schema.json index 7cfb5e5..3c2ee4d 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -169,6 +169,8 @@ # # @save-vm: guest is paused to save the VM state # +# @checkpoint-vm: guest is paused to checkpoint the VM state +# # @shutdown: guest is shut down (and -no-shutdown is in use) # # @suspended: guest is suspended (ACPI S3) @@ -181,7 +183,7 @@ 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', -'guest-panicked' ] } +'guest-panicked', 'checkpoint-vm' ] } ## # @SnapshotInfo diff --git a/vl.c b/vl.c index 316de54..2fb5b1f 100644 --- a/vl.c +++ b/vl.c @@ -552,6 +552,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_PAUSED, RUN_STATE_RUNNING }, { RUN_STATE_PAUSED, RUN_STATE_FINISH_MIGRATE }, +{ RUN_STATE_PAUSED, RUN_STATE_CHECKPOINT_VM }, { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING }, { RUN_STATE_POSTMIGRATE, RUN_STATE_FINISH_MIGRATE }, @@ -562,14 +563,18 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING }, { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE }, +{ RUN_STATE_FINISH_MIGRATE, RUN_STATE_CHECKPOINT_VM }, { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING }, +{ RUN_STATE_CHECKPOINT_VM, RUN_STATE_RUNNING }, + { RUN_STATE_RUNNING, RUN_STATE_DEBUG }, { RUN_STATE_RUNNING, RUN_STATE_INTERNAL_ERROR }, { RUN_STATE_RUNNING, RUN_STATE_IO_ERROR }, { RUN_STATE_RUNNING, RUN_STATE_PAUSED }, { RUN_STATE_RUNNING, RUN_STATE_FINISH_MIGRATE }, +{ RUN_STATE_RUNNING, RUN_STATE_CHECKPOINT_VM }, { RUN_STATE_RUNNING, RUN_STATE_RESTORE_VM }, { RUN_STATE_RUNNING, RUN_STATE_SAVE_VM }, { RUN_STATE_RUNNING, RUN_STATE_SHUTDOWN }, @@ -585,9 +590,11 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_RUNNING, RUN_STATE_SUSPENDED }, { RUN_STATE_SUSPENDED, RUN_STATE_RUNNING }, { RUN_STATE_SUSPENDED, RUN_STATE_FINISH_MIGRATE }, +{ RUN_STATE_SUSPENDED, RUN_STATE_CHECKPOINT_VM }, { RUN_STATE_WATCHDOG, RUN_STATE_RUNNING }, { RUN_STATE_WATCHDOG, RUN_STATE_FINISH_MIGRATE }, +{ RUN_STATE_WATCHDOG, RUN_STATE_CHECKPOINT_VM }, { RUN_STATE_GUEST_PANICKED, RUN_STATE_RUNNING }, { RUN_STATE_GUEST_PANICKED, RUN_STATE_FINISH_MIGRATE }, -- 1.8.1.2
[Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
From: Michael R. Hines mrhi...@us.ibm.com New capabilities include the use of RDMA acceleration, use of network buffering, and keepalive support, as documented in patch #1. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- qapi-schema.json | 36 +++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/qapi-schema.json b/qapi-schema.json index 98abdac..1fdf208 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -720,10 +720,44 @@ # @auto-converge: If enabled, QEMU will automatically throttle down the guest # to speed up convergence of RAM migration. (since 1.6) # +# @mc: The migration will never end, and the VM will instead be continuously +# micro-checkpointed (MC). Use the command migrate-set-mc-delay to +# control the frequency at which the checkpoints occur. +# Disabled by default. (Since 2.x) +# +# @mc-net-disable: Deactivate network buffering against outbound network +# traffic while Micro-Checkpointing (@mc) is active. +# Enabled by default. Disabling will make the MC protocol inconsistent +# and potentially break network connections upon an actual failure. +# Only for performance testing. (Since 2.x) +# +# @mc-rdma-copy: MC requires creating a local-memory checkpoint before +# transmission to the destination. This requires heavy use of +# memcpy() which dominates the processor pipeline. This option +# makes use of *local* RDMA to perform the copy instead of the CPU. +# Enabled by default only if the migration transport is RDMA. +# Disabled by default otherwise. (Since 2.x) +# +# @rdma-keepalive: RDMA connections do not timeout by themselves if a peer +# has disconnected prematurely or failed. User-level keepalives +# allow the migration to abort cleanly if there is a problem with the +# destination host. For debugging, this can be problematic as +# the keepalive may cause the peer to abort prematurely if we are +# at a GDB breakpoint, for example. +# Enabled by default. (Since 2.x) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', - 'data': ['xbzrle', 'x-rdma-pin-all', 'auto-converge', 'zero-blocks'] } + 'data': ['xbzrle', + 'rdma-pin-all', + 'auto-converge', + 'zero-blocks', + 'mc', + 'mc-net-disable', + 'mc-rdma-copy', + 'rdma-keepalive' + ] } ## # @MigrationCapabilityStatus -- 1.8.1.2
[Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
From: Michael R. Hines mrhi...@us.ibm.com This implements the core logic, all described in the first patch (docs/mc.txt). Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- migration-checkpoint.c | 1565 1 file changed, 1565 insertions(+) create mode 100644 migration-checkpoint.c diff --git a/migration-checkpoint.c b/migration-checkpoint.c new file mode 100644 index 000..a69edb2 --- /dev/null +++ b/migration-checkpoint.c @@ -0,0 +1,1565 @@ +/* + * Micro-Checkpointing (MC) support + * (a.k.a. Fault Tolerance or Continuous Replication) + * + * Copyright IBM, Corp. 2014 + * + * Authors: + * Michael R. Hines mrhi...@us.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or + * later. See the COPYING file in the top-level directory. + * + */ +#include libnl3/netlink/route/qdisc/plug.h +#include libnl3/netlink/route/class.h +#include libnl3/netlink/cli/utils.h +#include libnl3/netlink/cli/tc.h +#include libnl3/netlink/cli/qdisc.h +#include libnl3/netlink/cli/link.h +#include qemu-common.h +#include hw/virtio/virtio.h +#include hw/virtio/virtio-net.h +#include qemu/sockets.h +#include migration/migration.h +#include migration/qemu-file.h +#include qmp-commands.h +#include net/tap-linux.h +#include sys/ioctl.h + +#define DEBUG_MC +//#define DEBUG_MC_VERBOSE +//#define DEBUG_MC_REALLY_VERBOSE + +#ifdef DEBUG_MC +#define DPRINTF(fmt, ...) \ +do { printf(mc: fmt, ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) \ +do { } while (0) +#endif + +#ifdef DEBUG_MC_VERBOSE +#define DDPRINTF(fmt, ...) \ +do { printf(mc: fmt, ## __VA_ARGS__); } while (0) +#else +#define DDPRINTF(fmt, ...) \ +do { } while (0) +#endif + +#ifdef DEBUG_MC_REALLY_VERBOSE +#define DDDPRINTF(fmt, ...) \ +do { printf(mc: fmt, ## __VA_ARGS__); } while (0) +#else +#define DDDPRINTF(fmt, ...) \ +do { } while (0) +#endif + +/* + * Micro checkpoints (MC)s are typically only a few MB when idle. + * However, they can easily be very large during heavy workloads. + * In the *extreme* worst-case, QEMU might need double the amount of main memory + * than that of what was originally allocated to the virtual machine. + * + * To support this variability during transient periods, a MC + * consists of a linked list of slabs, each of identical size. A better name + * would be welcome, as the name was only chosen because it resembles linux + * memory allocation. Because MCs occur several times per second + * (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink + * without constantly re-allocating all memory in place during each checkpoint. + * + * During steady-state, the 'head' slab is permanently allocated and never goes + * away, so when the VM is idle, there is no memory allocation at all. + * This design supports the use of RDMA. Since RDMA requires memory pinning, we + * must be able to hold on to a slab for a reasonable amount of time to get any + * real use out of it. + * + * Regardless, the current strategy taken is: + * + * 1. If the checkpoint size increases, + *then grow the number of slabs to support it, + *(if and only if RDMA is activated, these slabs will be pinned.) + * 2. If the next checkpoint size is smaller than the last one, + then that's a strike. + * 3. After N strikes, cut the size of the slab cache in half + *(to a minimum of 1 slab as described before). + * + * As of this writing, a typical average size of + * an Idle-VM checkpoint is under 5MB. + */ + +#define MC_SLAB_BUFFER_SIZE (5UL * 1024UL * 1024UL) /* empirical */ +#define MC_DEV_NAME_MAX_SIZE256 + +#define MC_DEFAULT_CHECKPOINT_FREQ_MS 100 /* too slow, but best for now */ +#define CALC_MAX_STRIKES() \ +do { max_strikes = (max_strikes_delay_secs * 1000) / freq_ms; } \ +while (0) + +/* + * How many seconds-worth of checkpoints to wait before re-evaluating the size + * of the slab list? + * + * #strikes_until_shrink_cache = Function(#checkpoints/sec) + * + * Increasing the number of seconds also increases the number of strikes needed + * to be reached until it is time to cut the cache in half. + * + * Below value is open for debate - we just want it to be small enough to ensure + * that a large, idle slab list doesn't stay too large for too long. + */ +#define MC_DEFAULT_SLAB_MAX_CHECK_DELAY_SECS 10 + +/* + * MC serializes the actual RAM page contents in such a way that the actual + * pages are separated from the meta-data (all the QEMUFile stuff). + * + * This is done strictly for the purposes of being able to use RDMA + * and to replace memcpy() on the local machine for hardware with very + * fast RAM memory speeds. + * + * This serialization requires recording the page descriptions and then + * pushing them into slabs after the checkpoint has been captured + * (minus the page data). + * + * The memory holding the page descriptions are allocated
[Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying
From: Michael R. Hines mrhi...@us.ibm.com Just as RDMA has custom routines for saving memory, this provides RDMA with custom routines for loading and copying memory as well. Micro-checkpointing needs this support to avoid modifying the arch_init.c as little as possible while stilling being able to load RDMA-based memory from checkpoints in a performance-optimal way as they are received from the network. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- arch_init.c | 9 +++-- include/migration/migration.h | 33 -- include/migration/qemu-file.h | 54 +++-- qemu-file.c | 80 +-- 4 files changed, 167 insertions(+), 9 deletions(-) diff --git a/arch_init.c b/arch_init.c index b8364b0..db75120 100644 --- a/arch_init.c +++ b/arch_init.c @@ -540,7 +540,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage) /* In doubt sent page as normal */ bytes_sent = -1; ret = ram_control_save_page(f, block-offset, - offset, TARGET_PAGE_SIZE, bytes_sent); + block-host, offset, TARGET_PAGE_SIZE, bytes_sent); if (ret != RAM_SAVE_CONTROL_NOT_SUPP) { if (ret != RAM_SAVE_CONTROL_DELAYED) { @@ -1004,13 +1004,18 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) ram_handle_compressed(host, ch, TARGET_PAGE_SIZE); } else if (flags RAM_SAVE_FLAG_PAGE) { void *host; +int r; host = host_from_stream_offset(f, addr, flags); if (!host) { return -EINVAL; } -qemu_get_buffer(f, host, TARGET_PAGE_SIZE); +r = ram_control_load_page(f, host, TARGET_PAGE_SIZE); + +if (r == RAM_LOAD_CONTROL_NOT_SUPP) { +qemu_get_buffer(f, host, TARGET_PAGE_SIZE); +} } else if (flags RAM_SAVE_FLAG_XBZRLE) { void *host = host_from_stream_offset(f, addr, flags); if (!host) { diff --git a/include/migration/migration.h b/include/migration/migration.h index 9c62e2f..5c1a574 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -190,9 +190,38 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags); #define RAM_SAVE_CONTROL_NOT_SUPP -1000 #define RAM_SAVE_CONTROL_DELAYED -2000 +#define RAM_LOAD_CONTROL_NOT_SUPP -3000 +#define RAM_LOAD_CONTROL_DELAYED -4000 +#define RAM_COPY_CONTROL_NOT_SUPP -5000 +#define RAM_COPY_CONTROL_DELAYED -6000 -size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset, - ram_addr_t offset, size_t size, +#define RDMA_CONTROL_VERSION_CURRENT 1 + +int ram_control_save_page(QEMUFile *f, ram_addr_t block_offset, + uint8_t *host_addr, + ram_addr_t offset, long size, int *bytes_sent); +int ram_control_load_page(QEMUFile *f, + void *host_addr, + long size); + +int ram_control_copy_page(QEMUFile *f, + ram_addr_t block_offset_dest, + ram_addr_t offset_dest, + ram_addr_t block_offset_source, + ram_addr_t offset_source, + long size); + +int migrate_use_mc(void); +int migrate_use_mc_net(void); +int migrate_use_mc_rdma_copy(void); + +#define MC_VERSION 1 + +int mc_info_load(QEMUFile *f, void *opaque, int version_id); +void mc_info_save(QEMUFile *f, void *opaque); + +void qemu_rdma_info_save(QEMUFile *f, void *opaque); +int qemu_rdma_info_load(QEMUFile *f, void *opaque, int version_id); #endif diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h index a191fb6..c50de0d 100644 --- a/include/migration/qemu-file.h +++ b/include/migration/qemu-file.h @@ -71,17 +71,63 @@ typedef int (QEMURamHookFunc)(QEMUFile *f, void *opaque, uint64_t flags); #define RAM_CONTROL_ROUND1 #define RAM_CONTROL_HOOK 2 #define RAM_CONTROL_FINISH 3 +#define RAM_CONTROL_FLUSH4 /* * This function allows override of where the RAM page * is saved (such as RDMA, for example.) */ -typedef size_t (QEMURamSaveFunc)(QEMUFile *f, void *opaque, +typedef int (QEMURamSaveFunc)(QEMUFile *f, void *opaque, ram_addr_t block_offset, + uint8_t *host_addr, ram_addr_t offset, - size_t size, + long size, int *bytes_sent); +/* + * This function allows override of where the RAM page + * is saved (such as RDMA, for example.) + */ +typedef int (QEMURamLoadFunc)(QEMUFile *f, + void *opaque, + void
[Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support
From: Michael R. Hines mrhi...@us.ibm.com Self-explanatory. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- Makefile.objs | 1 + configure | 45 + 2 files changed, 46 insertions(+) diff --git a/Makefile.objs b/Makefile.objs index ac1d0e1..db70f93 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -53,6 +53,7 @@ common-obj-y += migration.o migration-tcp.o common-obj-y += vmstate.o common-obj-y += qemu-file.o common-obj-$(CONFIG_RDMA) += migration-rdma.o +common-obj-$(CONFIG_MC) += migration-checkpoint.o common-obj-y += qemu-char.o #aio.o common-obj-y += block-migration.o common-obj-y += page_cache.o xbzrle.o diff --git a/configure b/configure index 0eadab5..507ee99 100755 --- a/configure +++ b/configure @@ -197,6 +197,7 @@ kvm=no rdma= gprof=no debug_tcg=no +mc= debug=no strip_opt=yes tcg_interpreter=no @@ -1001,6 +1002,10 @@ for opt do ;; --enable-libssh2) libssh2=yes ;; + --disable-mc) mc=no + ;; + --enable-mc) mc=yes + ;; --enable-vhdx) vhdx=yes ;; --disable-vhdx) vhdx=no @@ -1189,6 +1194,8 @@ Advanced options (experts only): --enable-kvm enable KVM acceleration support --disable-rdma disable RDMA-based migration support --enable-rdmaenable RDMA-based migration support + --disable-mc disable Micro-Checkpointing support + --enable-mc enable Micro-Checkpointing support --enable-tcg-interpreter enable TCG with bytecode interpreter (TCI) --enable-system enable all system emulation targets --disable-system disable all system emulation targets @@ -1901,6 +1908,35 @@ EOF fi fi +## +# Micro-Checkpointing requires netlink (libnl3) +if test $mc != no ; then + cat $TMPC EOF +#include libnl3/netlink/route/qdisc/plug.h +#include libnl3/netlink/route/class.h +#include libnl3/netlink/cli/utils.h +#include libnl3/netlink/cli/tc.h +#include libnl3/netlink/cli/qdisc.h +#include libnl3/netlink/cli/link.h +int main(void) { return 0; } +EOF + mc_libs=-lnl-3 -lnl-cli-3 -lnl-route-3 + mc_cflags=-I/usr/include/libnl3 + if compile_prog $mc_cflags $mc_libs ; then +mc=yes +libs_softmmu=$libs_softmmu $mc_libs +QEMU_CFLAGS=$QEMU_CFLAGS $mc_cflags + else +if test $mc = yes ; then +error_exit \ + NetLink v3 libs/headers not present. \ + Please install the libnl3-*-dev(el) packages from your distro. +fi +mc=no + fi +fi + + ## # VNC TLS/WS detection if test $vnc = yes -a \( $vnc_tls != no -o $vnc_ws != no \) ; then @@ -3829,6 +3865,7 @@ echo KVM support $kvm echo RDMA support $rdma echo TCG interpreter $tcg_interpreter echo fdt support $fdt +echo Micro checkpointing $mc echo preadv support$preadv echo fdatasync $fdatasync echo madvise $madvise @@ -4334,6 +4371,10 @@ if test $rdma = yes ; then echo CONFIG_RDMA=y $config_host_mak fi +if test $mc = yes ; then + echo CONFIG_MC=y $config_host_mak +fi + if test $tcg_interpreter = yes; then QEMU_INCLUDES=-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES elif test $ARCH = sparc64 ; then @@ -4769,6 +4810,10 @@ echo QEMU_CFLAGS+=$cflags $config_target_mak done # for target in $targets +if test $mc = yes ; then +echo CONFIG_MC=y $config_host_mak +fi + if [ $pixman = internal ]; then echo config-host.h: subdir-pixman $config_host_mak fi -- 1.8.1.2
[Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
From: Michael R. Hines mrhi...@us.ibm.com This patch sets up the initial changes to the migration state machine and prototypes to be used by the checkpointing code to interact with the state machine so that we can later handle failure and recovery scenarios. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- arch_init.c | 29 - include/migration/migration.h | 2 ++ migration.c | 37 + 3 files changed, 47 insertions(+), 21 deletions(-) diff --git a/arch_init.c b/arch_init.c index db75120..e9d4d9e 100644 --- a/arch_init.c +++ b/arch_init.c @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque) migration_end(); } -static void reset_ram_globals(void) +static void reset_ram_globals(bool reset_bulk_stage) { last_seen_block = NULL; last_sent_block = NULL; last_offset = 0; last_version = ram_list.version; -ram_bulk_stage = true; +ram_bulk_stage = reset_bulk_stage; } #define MAX_WAIT 50 /* ms, half buffered_file limit */ @@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque) RAMBlock *block; int64_t ram_pages = last_ram_offset() TARGET_PAGE_BITS; +/* + * RAM stays open during micro-checkpointing for the next transaction. + */ +if (migration_is_mc(migrate_get_current())) { +qemu_mutex_lock_ramlist(); +reset_ram_globals(false); +goto skip_setup; +} + migration_bitmap = bitmap_new(ram_pages); bitmap_set(migration_bitmap, 0, ram_pages); migration_dirty_pages = ram_pages; @@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque) qemu_mutex_lock_iothread(); qemu_mutex_lock_ramlist(); bytes_transferred = 0; -reset_ram_globals(); +reset_ram_globals(true); memory_global_dirty_log_start(); migration_bitmap_sync(); qemu_mutex_unlock_iothread(); +skip_setup: + qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE); QTAILQ_FOREACH(block, ram_list.blocks, next) { @@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) qemu_mutex_lock_ramlist(); if (ram_list.version != last_version) { -reset_ram_globals(); +reset_ram_globals(true); } ram_control_before_iterate(f, RAM_CONTROL_ROUND); @@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void *opaque) } ram_control_after_iterate(f, RAM_CONTROL_FINISH); -migration_end(); + +/* + * Only cleanup at the end of normal migrations + * or if the MC destination failed and we got an error. + * Otherwise, we are (or will soon be) in MIG_STATE_CHECKPOINTING. + */ +if(!migrate_use_mc() || migration_has_failed(migrate_get_current())) { +migration_end(); +} qemu_mutex_unlock_ramlist(); qemu_put_be64(f, RAM_SAVE_FLAG_EOS); diff --git a/include/migration/migration.h b/include/migration/migration.h index a7c54fe..e876a2c 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -101,7 +101,9 @@ int migrate_fd_close(MigrationState *s); void add_migration_state_change_notifier(Notifier *notify); void remove_migration_state_change_notifier(Notifier *notify); +bool migration_is_active(MigrationState *); bool migration_in_setup(MigrationState *); +bool migration_is_mc(MigrationState *s); bool migration_has_finished(MigrationState *); bool migration_has_failed(MigrationState *); MigrationState *migrate_get_current(void); diff --git a/migration.c b/migration.c index 25add6f..f42dae4 100644 --- a/migration.c +++ b/migration.c @@ -36,16 +36,6 @@ do { } while (0) #endif -enum { -MIG_STATE_ERROR = -1, -MIG_STATE_NONE, -MIG_STATE_SETUP, -MIG_STATE_CANCELLING, -MIG_STATE_CANCELLED, -MIG_STATE_ACTIVE, -MIG_STATE_COMPLETED, -}; - #define MAX_THROTTLE (32 20) /* Migration speed throttling */ /* Amount of time to allocate to each chunk of bandwidth-throttled @@ -273,7 +263,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params, MigrationState *s = migrate_get_current(); MigrationCapabilityStatusList *cap; -if (s-state == MIG_STATE_ACTIVE || s-state == MIG_STATE_SETUP) { +if (migration_is_active(s)) { error_set(errp, QERR_MIGRATION_ACTIVE); return; } @@ -285,7 +275,13 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params, /* shared migration helpers */ -static void migrate_set_state(MigrationState *s, int old_state, int new_state) +bool migration_is_active(MigrationState *s) +{ +return (s-state == MIG_STATE_ACTIVE) || s-state == MIG_STATE_SETUP +|| s-state == MIG_STATE_CHECKPOINTING; +} + +void migrate_set_state(MigrationState *s, int old_state, int new_state) { if (atomic_cmpxchg(s-state, old_state, new_state) == new_state) { trace_migrate_set_state(new_state);
[Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
From: Michael R. Hines mrhi...@us.ibm.com This exposes a QMP command that allows the management software or policy to control the frequency of micro-checkpointing. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- hmp-commands.hx | 16 +++- hmp.c| 6 ++ hmp.h| 1 + qapi-schema.json | 13 + qmp-commands.hx | 23 +++ 5 files changed, 58 insertions(+), 1 deletion(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index f3fc514..2066c76 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -888,7 +888,7 @@ ETEXI \n\t\t\t -b for migration without shared storage with full copy of disk\n\t\t\t -i for migration without shared storage with incremental copy of disk - (base image shared between src and destination), + (base image shared between src and destination), .mhandler.cmd = hmp_migrate, }, @@ -965,6 +965,20 @@ Set maximum tolerated downtime (in seconds) for migration. ETEXI { +.name = migrate-set-mc-delay, +.args_type = value:i, +.params = value, +.help = Set maximum delay (in milliseconds) between micro-checkpoints, +.mhandler.cmd = hmp_migrate_set_mc_delay, +}, + +STEXI +@item migrate-set-mc-delay @var{millisecond} +@findex migrate-set-mc-delay +Set maximum delay (in milliseconds) between micro-checkpoints. +ETEXI + +{ .name = migrate_set_capability, .args_type = capability:s,state:b, .params = capability state, diff --git a/hmp.c b/hmp.c index edf062e..9880bc8 100644 --- a/hmp.c +++ b/hmp.c @@ -1029,6 +1029,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict) qmp_migrate_set_downtime(value, NULL); } +void hmp_migrate_set_mc_delay(Monitor *mon, const QDict *qdict) +{ +int64_t value = qdict_get_int(qdict, value); +qmp_migrate_set_mc_delay(value, NULL); +} + void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict) { int64_t value = qdict_get_int(qdict, value); diff --git a/hmp.h b/hmp.h index ed58f0e..068b2c1 100644 --- a/hmp.h +++ b/hmp.h @@ -60,6 +60,7 @@ void hmp_drive_mirror(Monitor *mon, const QDict *qdict); void hmp_drive_backup(Monitor *mon, const QDict *qdict); void hmp_migrate_cancel(Monitor *mon, const QDict *qdict); void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict); +void hmp_migrate_set_mc_delay(Monitor *mon, const QDict *qdict); void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict); void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict); void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict); diff --git a/qapi-schema.json b/qapi-schema.json index 7306adc..98abdac 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -2160,6 +2160,19 @@ { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} } ## +# @migrate-set-mc-delay +# +# Set delay (in milliseconds) between micro checkpoints. +# +# @value: maximum delay in milliseconds +# +# Returns: nothing on success +# +# Since: 2.x +## +{ 'command': 'migrate-set-mc-delay', 'data': {'value': 'int'} } + +## # @migrate_set_speed # # Set maximum speed for migration. diff --git a/qmp-commands.hx b/qmp-commands.hx index cce6b81..d8b9c34 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -754,6 +754,29 @@ Example: EQMP { +.name = migrate-set-mc-delay, +.args_type = value:i, +.mhandler.cmd_new = qmp_marshal_input_migrate_set_mc_delay, +}, + +SQMP +migrate-set-mc-delay + + +Set maximum delay (in milliseconds) between micro-checkpoints. + +Arguments: + +- value: maximum delay (json-int) + +Example: + +- { execute: migrate-set-mc-delay, arguments: { value: 100 } } +- { return: {} } + +EQMP + +{ .name = client_migrate_info, .args_type = protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?, .params = protocol hostname port tls-port cert-subject, -- 1.8.1.2
[Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested
From: Michael R. Hines mrhi...@us.ibm.com Once the initial migration has completed, we kickoff the migration_thread, which never dies. Additionally, we register load/save functions for MC which allow us to inform the destination that we are requesting a micro-checkpointing session without needing to add additional command-line switches on the destination side. Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- include/migration/migration.h | 2 +- include/migration/qemu-file.h | 1 + migration.c | 33 - vl.c | 2 ++ 4 files changed, 32 insertions(+), 6 deletions(-) diff --git a/include/migration/migration.h b/include/migration/migration.h index f18ff5e..1695e9e 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -46,7 +46,7 @@ struct MigrationState int64_t bandwidth_limit; size_t bytes_xfer; size_t xfer_limit; -QemuThread thread; +QemuThread *thread; QEMUBH *cleanup_bh; QEMUFile *file; diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h index c50de0d..89e7c19 100644 --- a/include/migration/qemu-file.h +++ b/include/migration/qemu-file.h @@ -148,6 +148,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops); QEMUFile *qemu_fopen(const char *filename, const char *mode); QEMUFile *qemu_fdopen(int fd, const char *mode); QEMUFile *qemu_fopen_socket(int fd, const char *mode); +QEMUFile *qemu_fopen_mc(void *opaque, const char *mode); QEMUFile *qemu_popen_cmd(const char *command, const char *mode); int qemu_get_fd(QEMUFile *f); int qemu_fclose(QEMUFile *f); diff --git a/migration.c b/migration.c index 0ccbeaa..1a6fdbd 100644 --- a/migration.c +++ b/migration.c @@ -93,6 +93,9 @@ static void process_incoming_migration_co(void *opaque) int ret; ret = qemu_loadvm_state(f); +if (ret = 0) { +mc_process_incoming_checkpoints_if_requested(f); +} qemu_fclose(f); free_xbzrle_decoded_buf(); if (ret 0) { @@ -313,14 +316,17 @@ static void migrate_fd_cleanup(void *opaque) { MigrationState *s = opaque; -qemu_bh_delete(s-cleanup_bh); -s-cleanup_bh = NULL; +if(s-cleanup_bh) { +qemu_bh_delete(s-cleanup_bh); +s-cleanup_bh = NULL; +} if (s-file) { DPRINTF(closing file\n); qemu_mutex_unlock_iothread(); -qemu_thread_join(s-thread); +qemu_thread_join(s-thread); qemu_mutex_lock_iothread(); +g_free(s-thread); qemu_fclose(s-file); s-file = NULL; @@ -695,11 +701,27 @@ static void *migration_thread(void *opaque) s-downtime = end_time - start_time; runstate_set(RUN_STATE_POSTMIGRATE); } else { +if(migrate_use_mc()) { +qemu_fflush(s-file); +if (migrate_use_mc_net()) { +if (mc_enable_buffering() 0 || +mc_start_buffer() 0) { +migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_ERROR); +} +} +} + if (old_vm_running) { vm_start(); } } -qemu_bh_schedule(s-cleanup_bh); + +if (migrate_use_mc() s-state != MIG_STATE_ERROR) { +mc_init_checkpointer(s); +} else { +qemu_bh_schedule(s-cleanup_bh); +} + qemu_mutex_unlock_iothread(); return NULL; @@ -720,6 +742,7 @@ void migrate_fd_connect(MigrationState *s) /* Notify before starting migration thread */ notifier_list_notify(migration_state_notifiers, s); -qemu_thread_create(s-thread, migration_thread, s, +s-thread = g_malloc0(sizeof(*s-thread)); +qemu_thread_create(s-thread, migration_thread, s, QEMU_THREAD_JOINABLE); } diff --git a/vl.c b/vl.c index 2fb5b1f..093eb20 100644 --- a/vl.c +++ b/vl.c @@ -4145,6 +4145,8 @@ int main(int argc, char **argv, char **envp) default_drive(default_sdcard, snapshot, IF_SD, 0, SD_OPTS); register_savevm_live(NULL, ram, 0, 4, savevm_ram_handlers, NULL); +register_savevm(NULL, mc, -1, MC_VERSION, mc_info_save, +mc_info_load, NULL); if (nb_numa_nodes 0) { int i; -- 1.8.1.2
Re: [Qemu-devel] [PATCH 2/3] qtest: make QEMU our direct child process
On Mon, Feb 17, 2014 at 05:44:55PM +0100, Markus Armbruster wrote: Stefan Hajnoczi stefa...@redhat.com writes: qtest_init() cannot use exec*p() to launch QEMU since the exec*p() functions take an argument array while qtest_init() takes char *extra_args. Therefore we execute /bin/sh -c command-line and let the shell parse the argument string. This left /bin/sh as our child process and our child's child was QEMU. We still want QEMU's pid so the -pidfile option was used to let QEMU report its pid. The pidfile needs to be unlinked when the test case exits or fails. In other words, the pidfile creates a new problem for us! Simplify all this using the shell 'exec' command. It allows us to replace the /bin/sh process with QEMU. Then we no longer need to use -pidfile because we already know our fork child's pid. Note: Yes, it seems silly to exec /bin/sh when we could just exec QEMU directly. But remember qtest_init() takes a single char *extra_args command-line fragment instead of a real argv[] array, so we need /bin/sh's argument parsing behavior. Sounds like a design mistake to me. I wouldn't call char *extra_args a mistake because strings are still simpler to manipulate in C than char *argv[] arrays. So we write less code in test cases and libqtest.c at the expense of a roundabout way to spawn the process. Stefan
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
On Mon, Feb 17, 2014 at 06:00:03PM +0100, Markus Armbruster wrote: Paolo Bonzini pbonz...@redhat.com writes: Il 17/02/2014 16:44, Stefan Hajnoczi ha scritto: } +static void sigabrt_handler(int signo) +{ +qtest_end(); +} + void qtest_quit(QTestState *s) { int status; if (s-qemu_pid != -1) { kill(s-qemu_pid, SIGTERM); waitpid(s-qemu_pid, status, 0); } close(s-fd); close(s-qmp_fd); g_string_free(s-rx, true); g_free(s); } Not async-signal safe. You need to ignore the g_string_free and g_free (perhaps even the closes) if calling from the sigabrt_handler. kill(), waitpid() and close() are all async-signal-safe. SIGABRT is normally synchronous enough: it's sent by abort(). But of course, nothing stops the user from kill -ABRT. Or GLib from calling abort() in some place where an attempt to reenter it crashes burns. Not sure I'd care, but I'm pretty sure I don't care for freeing stuff on exit :) Yes, SIGABRT is synchronous for all purposes. So the only danger is that g_string_free() or g_free() could fail while we're in g_assert(false). But they don't, which makes sense because they are totally unrelated to g_assert() and therefore can handle re-entrancy. In practice there is no issue and I've tested assertion failure with glib 1.2.10. Stefan
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
On Mon, Feb 17, 2014 at 05:49:31PM +0100, Markus Armbruster wrote: Stefan Hajnoczi stefa...@redhat.com writes: +/* Catch SIGABRT to clean up on g_assert() failure */ +sigact = (struct sigaction){ +.sa_handler = sigabrt_handler, +.sa_flags = SA_RESETHAND, +}; Assumes zero-initialization has the same effect as sigemptyset(sigact.sa_mask). Quoting POSIX: The implementation of the sigemptyset() (or sigfillset()) function could quite trivially clear (or set) all the bits in the signal set. Alternatively, it would be reasonable to initialize part of the structure, such as a version field, to permit binary-compatibility between releases where the size of the set varies. For such reasons, either sigemptyset() or sigfillset() must be called prior to any other use of the signal set, even if such use is read-only (for example, as an argument to sigpending()). Looks like you better sigemptyset() here, for maximum portability. Okay, will do that. +sigaction(SIGABRT, sigact, s-sigact_old); socket_path = g_strdup_printf(/tmp/qtest-%d.sock, getpid()); qmp_socket_path = g_strdup_printf(/tmp/qtest-%d.qmp, getpid()); @@ -150,13 +167,19 @@ void qtest_quit(QTestState *s) { int status; +sigaction(SIGABRT, s-sigact_old, NULL); + Can this ever restore the action to anything but the default action? This code is in a library so I don't want to make assumptions about the callers. Keeping sigact_old around is literally 1 line of code so I think it's worth it. if (s-qemu_pid != -1) { kill(s-qemu_pid, SIGTERM); waitpid(s-qemu_pid, status, 0); } -close(s-fd); -close(s-qmp_fd); +if (s-fd != -1) { +close(s-fd); +} +if (s-qmp_fd != -1) { +close(s-qmp_fd); +} I generally don't bother to avoid close(-1). When I drive on the highway I stay on the lanes but I guess I could just slide along the side barriers :). It's a style issue but close(-1) annoys me in strace so I try to avoid doing it.
Re: [Qemu-devel] [PATCH v3 0/4] X86/KVM: enable Intel MPX for KVM
Il 22/01/2014 13:03, Paolo Bonzini ha scritto: Il 22/01/2014 06:29, Liu, Jinsong ha scritto: These patches are version 3 to enalbe Intel MPX for KVM. Version 1: * Add some Intel MPX definiation * Fix a cpuid(0x0d, 0) exposing bug, dynamic per XCR0 features enable/disable * vmx and msr handle for MPX support at KVM * enalbe MPX feature for guest Version 2: * remove generic MPX definiation, Qiaowei's patch has add the definiation at kernel side * add MSR_IA32_BNDCFGS to msrs_to_save Version 3: * rebase on latest kernel, which include Qiaowei's MPX common definiation pulled from HPA's tree I am afraid there is still some work to do on these patches, so they need to be delayed to 3.15. Patch 1: this seems mostly separate from the rest of the MPX work. I commented on the missing ULL suffix, but I would also like to understand why you put this patch in this series. Patch 2: As remarked in the reply to this patch: - the vmx_disable_intercept_for_msr has to be unconditional - you need a new kvm_x86_ops member mpx_supported, to disable MPX whenever the two VMX controls are not available. Patch 3: this patch needs to be rebased. Apart from that it is fine, but please move the VMX bits together with patch 2, and the other bits together with patch 4. Patch 4: this patch needs to be rebased and to use the new mpx_supported member If you also want to look at nested VMX support for MPX, that would be nice. It should not be hard. Otherwise we can take care of that later. Thanks for your work, Paolo Are you going to send v4? Paolo
Re: [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing
Hi, Michael this patch-set will break normal build(without --enable-mc): migration.c: In function ‘migrate_rdma_pin_all’: migration.c:564: error: ‘MIGRATION_CAPABILITY_X_RDMA_PIN_ALL’ undeclared (first use in this function) migration.c:564: error: for each function it appears in.) Thanks! Li Guang mrhi...@linux.vnet.ibm.com wrote: From: Michael R. Hinesmrhi...@us.ibm.com Changes since v1: 1. Re-based against Juan's improved migration_bitmap performance changes 2. Overhauled RDMA support to prepare for better usage of RDMA in other parts of the QEMU code base (such as storage). 3. Fix for netlink issues that failed to cleanup the network buffer device for development testing. Michael R. Hines (12): mc: add documentation for micro-checkpointing mc: timestamp migration_bitmap and KVM logdirty usage mc: introduce a 'checkpointing' status check into the VCPU states mc: support custom page loading and copying rdma: accelerated memcpy() support and better external RDMA user interfaces mc: introduce state machine changes for MC mc: introduce additional QMP statistics for micro-checkpointing mc: core logic mc: configure and makefile support mc: expose tunable parameter for checkpointing frequency mc: introduce new capabilities to control micro-checkpointing mc: activate and use MC if requested Makefile.objs |1 + arch_init.c | 72 +- configure | 45 + cpus.c|9 +- docs/mc.txt | 222 hmp-commands.hx | 16 +- hmp.c | 23 + hmp.h |1 + include/migration/migration.h | 70 +- include/migration/qemu-file.h | 55 +- migration-checkpoint.c| 1565 + migration-rdma.c | 2605 +++-- migration.c | 156 ++- qapi-schema.json | 86 +- qemu-file.c | 80 +- qmp-commands.hx | 23 + vl.c |9 + 17 files changed, 4097 insertions(+), 941 deletions(-) create mode 100644 docs/mc.txt create mode 100644 migration-checkpoint.c
Re: [Qemu-devel] [PATCH v3 0/4] X86/KVM: enable Intel MPX for KVM
Paolo Bonzini wrote: Il 22/01/2014 13:03, Paolo Bonzini ha scritto: Il 22/01/2014 06:29, Liu, Jinsong ha scritto: These patches are version 3 to enalbe Intel MPX for KVM. Version 1: * Add some Intel MPX definiation * Fix a cpuid(0x0d, 0) exposing bug, dynamic per XCR0 features enable/disable * vmx and msr handle for MPX support at KVM * enalbe MPX feature for guest Version 2: * remove generic MPX definiation, Qiaowei's patch has add the definiation at kernel side * add MSR_IA32_BNDCFGS to msrs_to_save Version 3: * rebase on latest kernel, which include Qiaowei's MPX common definiation pulled from HPA's tree I am afraid there is still some work to do on these patches, so they need to be delayed to 3.15. Patch 1: this seems mostly separate from the rest of the MPX work. I commented on the missing ULL suffix, but I would also like to understand why you put this patch in this series. Patch 2: As remarked in the reply to this patch: - the vmx_disable_intercept_for_msr has to be unconditional - you need a new kvm_x86_ops member mpx_supported, to disable MPX whenever the two VMX controls are not available. Patch 3: this patch needs to be rebased. Apart from that it is fine, but please move the VMX bits together with patch 2, and the other bits together with patch 4. Patch 4: this patch needs to be rebased and to use the new mpx_supported member If you also want to look at nested VMX support for MPX, that would be nice. It should not be hard. Otherwise we can take care of that later. Thanks for your work, Paolo Are you going to send v4? Paolo Yes, I just return from long Chinese Spring Festival, I will send V4 later. Thanks, Jinsong
Re: [Qemu-devel] [PATCH v2] RFC: Add blockdev-del QMP command
Kevin Wolf kw...@redhat.com writes: Am 12.02.2014 um 18:36 hat Ian Main geschrieben: This is the sister command to blockdev-add. In Fam's example he uses the drive_del HMP command to clean up but it would be much nicer to have a way to do this via QMP. Signed-off-by: Ian Main im...@redhat.com diff --git a/qapi-schema.json b/qapi-schema.json index d22651c..01186cd 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -4469,3 +4469,14 @@ # Since: 1.7 ## { 'command': 'blockdev-add', 'data': { 'options': 'BlockdevOptions' } } + +## +# @blockdev-del: +# +# Delete a block device. +# +# @device: Identifier for the block device to be deleted. +# +# Since: 2.0 +## +{ 'command': 'blockdev-del', 'data': { 'device': 'str' } } I believe the full documentation should go here as well, not just in qmp-commands.hx. Yes. diff --git a/qmp-commands.hx b/qmp-commands.hx index c3ee46a..f08045d 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -3442,6 +3442,36 @@ Example (2): EQMP { +.name = blockdev-del, +.args_type = device:s, +.mhandler.cmd_new = qmp_marshal_input_blockdev_del, +}, + +SQMP +blockdev-del + + +Remove host block device. The result is that guest generated IO is no +longer submitted against the host device underlying the disk. Once a +drive has been deleted, the QEMU Block layer returns -EIO which results +in IO errors in the guest for applications that are reading/writing to +the device. These errors are always reported to the guest, regardless +of the drive's error actions (drive options rerror, werror). I think we wanted to have different semantics for blockdev-del. Specifically, it is a backend command that should have no effect on users of that backend. Let me paste and comment on some notes I made in a previous blockdev discussion: * Make sure that an explicit blockdev-del is needed to remove the BDS; it shouldn't happen automagically just because the guest device was unplugged [done] * By default, return an error for blockdev-del if reference count 1 Yes. Deleting a block device that is not in use is a perfectly innocent operation. Deleting one that is in use isn't; it can corrupt data. Innocent and dangerous operations should be separate. A force option is an acceptable separator. Having a separate command would be another. [ The assumption is here that one reference is held by the monitor/external user, which isn't true today to my knowledge ] Here's my working hypothesis on this matter: blockdev_init() initializes the reference count to 1. This is for the reference it puts into the list of drives. It also returns a weak reference. We pair blockdev_init() with a drive_put_ref(). drive_put_ref() knows it must be paired with blockdev_init() when the reference count is 1. Then it deletes from drives in addition to decrementing the count. Actually, nothing else ever uses the reference count since commit fa510eb switch block jobs over to the BDS reference count. Worth a cleanup. - But have a force option that closes the image file, even if it breaks the remaining users (e.g. uncooperative guest that doesn't release its PCI device) [ Here we need working refcounting including the external user, because otherwise we don't free the (closed) BDS even when the guest device is unplugged. It is an open question whether and how BDSes without an external reference are shown in the monitor. ] Despite its name, the purpose of drive_del is not really deleting a drive, it's revoking access to the host resources backing a drive. drive_del doesn't delete the drive, it merely detaches the BlockDriverState from its host resources, and schedules it for automatic deletion. That way, references beyond drive_del's reach remain valid. Special case: if it's not in use by a front end, delete it right away. Assumes that references beyond drive_del's reach exist only while it is in use by a frone end. I think there are basic operations struggling to get out of this mess: * Create block device (and put it into drives) * Delete block device (and remove it from drives) Fails if the block device is in use * Revoke access to host resources * Start use of block device * Stop use of block device * Prevent mixing blockdev-add with drive_del and vice versa - Ideally drive_add BDSes are exactly those with a DriveInfo [ not true today, but DriveInfo.enable_auto_del can be used to distinguish them ] So I believe we still have some design work to do before we can actually implement this. I would prefer not to merge this for 2.0. I'm afraid you're right.
Re: [Qemu-devel] [PATCH v3] Fix QEMU build on OpenBSD on x86 archs
On 18 February 2014 00:36, Brad Smith b...@comstyle.com wrote: Just one issue... just curious why my last name is missing from the commit? Ah, we've run into a bug in patchwork. I downloaded your patch: http://patchwork.ozlabs.org/patch/299656/ but patchwork reconstructs patch mboxes from its database rather than just returning the original email, and it has a bug where it will always remember the name used by the very first email it ever saw from you. I guess it has a patch in the system way back from you with a misconfigured fullname. Sorry about that. I know they've discussed this issue on the patchwork mailing list but they obviously haven't implemented a fix yet. thanks -- PMM
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 18 February 2014 01:02, Andreas Färber afaer...@suse.de wrote: Shouldn't 1/3 and 3/3 be included in 1.7.1 as guest-triggerable bugs? There's no qemu-sta...@nongnu.org in the commit message. That's up to whoever maintains stable, I guess. I have no objection. thanks -- PMM
Re: [Qemu-devel] Cortex-M3: reading NVIC registers causes segfaults
On 18 February 2014 01:14, Andreas Färber afaer...@suse.de wrote: While first_cpu may help Andreas in his local copy for STM32, that assumption is not okay in general. The Vybrid VF6 has both a GIC and an NVIC, so our NVIC code should not make assumptions which CPU it can access. Do you mean it has an M core CPU with an NVIC plus some other A-class CPU with a GIC. Certainly in a system with heterogenous CPUs this change won't work, but we don't support that at all currently and there's a lot of work between here and there. I assume we would shield both using CPU address spaces The GIC needs to know which CPU it's being accessed by. I guess in theory you could have it export N distributor memory regions which were identical apart from making I assume I'm being accessed by CPU X assumptions, but that seems like a bit of a kludge. Ideally we should have some way for the target of a memory transaction to get at information about the source (ie all the out of band info that hardware sends like master ID, privilege level, secure vs nonsecure, etc). but I wonder if either of you two has already thought about how those will interact for gdbstub? I'd need to look at the protocol to figure out if gdbstub does memory accesses in the context of a particular thread (CPU for us) or not. If not, that could be awkward. thanks -- PMM
Re: [Qemu-devel] [PATCH 2/3] qtest: make QEMU our direct child process
Stefan Hajnoczi stefa...@gmail.com writes: On Mon, Feb 17, 2014 at 05:44:55PM +0100, Markus Armbruster wrote: Stefan Hajnoczi stefa...@redhat.com writes: qtest_init() cannot use exec*p() to launch QEMU since the exec*p() functions take an argument array while qtest_init() takes char *extra_args. Therefore we execute /bin/sh -c command-line and let the shell parse the argument string. This left /bin/sh as our child process and our child's child was QEMU. We still want QEMU's pid so the -pidfile option was used to let QEMU report its pid. The pidfile needs to be unlinked when the test case exits or fails. In other words, the pidfile creates a new problem for us! Simplify all this using the shell 'exec' command. It allows us to replace the /bin/sh process with QEMU. Then we no longer need to use -pidfile because we already know our fork child's pid. Note: Yes, it seems silly to exec /bin/sh when we could just exec QEMU directly. But remember qtest_init() takes a single char *extra_args command-line fragment instead of a real argv[] array, so we need /bin/sh's argument parsing behavior. Sounds like a design mistake to me. I wouldn't call char *extra_args a mistake because strings are still simpler to manipulate in C than char *argv[] arrays. So we write less code in test cases and libqtest.c at the expense of a roundabout way to spawn the process. Building command arguments in a string is deceptively simple. The deception collapses once you add funny characters that make the shell do funny and unexpected things. Our extra_args string manipulations are typically of the form format a bunch of options and option arguments into a string. I suspect using a function to collect a bunch of options and option arguments into an array would be about the same amount of code in most cases. Just sayin'; I'm not objecting to your patch.
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
Stefan Hajnoczi stefa...@gmail.com writes: On Mon, Feb 17, 2014 at 05:49:31PM +0100, Markus Armbruster wrote: Stefan Hajnoczi stefa...@redhat.com writes: +/* Catch SIGABRT to clean up on g_assert() failure */ +sigact = (struct sigaction){ +.sa_handler = sigabrt_handler, +.sa_flags = SA_RESETHAND, +}; Assumes zero-initialization has the same effect as sigemptyset(sigact.sa_mask). Quoting POSIX: The implementation of the sigemptyset() (or sigfillset()) function could quite trivially clear (or set) all the bits in the signal set. Alternatively, it would be reasonable to initialize part of the structure, such as a version field, to permit binary-compatibility between releases where the size of the set varies. For such reasons, either sigemptyset() or sigfillset() must be called prior to any other use of the signal set, even if such use is read-only (for example, as an argument to sigpending()). Looks like you better sigemptyset() here, for maximum portability. Okay, will do that. +sigaction(SIGABRT, sigact, s-sigact_old); socket_path = g_strdup_printf(/tmp/qtest-%d.sock, getpid()); qmp_socket_path = g_strdup_printf(/tmp/qtest-%d.qmp, getpid()); @@ -150,13 +167,19 @@ void qtest_quit(QTestState *s) { int status; +sigaction(SIGABRT, s-sigact_old, NULL); + Can this ever restore the action to anything but the default action? This code is in a library so I don't want to make assumptions about the callers. Keeping sigact_old around is literally 1 line of code so I think it's worth it. if (s-qemu_pid != -1) { kill(s-qemu_pid, SIGTERM); waitpid(s-qemu_pid, status, 0); } -close(s-fd); -close(s-qmp_fd); +if (s-fd != -1) { +close(s-fd); +} +if (s-qmp_fd != -1) { +close(s-qmp_fd); +} I generally don't bother to avoid close(-1). When I drive on the highway I stay on the lanes but I guess I could just slide along the side barriers :). It's a style issue but close(-1) annoys me in strace so I try to avoid doing it. For me, it's in the same category as free(NULL).
Re: [Qemu-devel] [PATCH] virtio-net: only output the vlan table when VIRTIO_NET_F_CTRL_VLAN is negotiated
On Mon, Feb 17, 2014 at 11:58:11AM -0500, Vlad Yasevich wrote: On 02/17/2014 11:56 AM, Eric Blake wrote: On 02/17/2014 09:52 AM, Eric Blake wrote: On 02/16/2014 07:27 PM, Amos Kong wrote: Stefan Fritsch just fixed a virtio-net driver bug [1], virtio-net won't filter out VLAN-tagged packets if VIRTIO_NET_F_CTRL_VLAN isn't negotiated. We should also not send the vlan table to management, this patch makes the vlan-talbe optional. s/talbe/table/ @@ -4053,7 +4053,7 @@ 'multicast-overflow': 'bool', 'unicast-overflow': 'bool', 'main-mac': 'str', -'vlan-table': ['int'], +'*vlan-table': ['int'], Indentation is now off. - main-mac: main macaddr string (json-string) -- vlan-table: a json-array of active vlan id +- vlan-table: a json-array of active vlan id (optoinal) s/optoinal/optional/ Those fixes are trivial enough, so I'm okay if you correct them then add: Scratch that. I revoke my R-b, on the following grounds: On 02/17/2014 08:27 AM, Vlad Yasevich wrote: On 02/16/2014 09:27 PM, Amos Kong wrote: Stefan Fritsch just fixed a virtio-net driver bug [1], virtio-net won't filter out VLAN-tagged packets if VIRTIO_NET_F_CTRL_VLAN isn't negotiated. We should also not send the vlan table to management, this patch makes the vlan-talbe optional. ^^ table. One question I have from the API perspective is can we suddenly change something to be optional? If there are any users of this , wouldn't they have to change now to check the existence of this list? You are correct. Since the parameter is an output field, older clients may be depending on it existing. It is okay to generate an empty array, but you must not entirely omit the array unless you add further justification in your commit message that you are 100% positive that there are no clients of 1.6 that will be broken when the array no longer appears in the output. Can you rework the patch to just leave the array empty in the case where the bit does not indicate it is used? Or do we need to add a new bool field to the output for new enough management to know whether to use the array? I think a completely empty array should be sufficient. -vlad An empty array could mean either no vlans allowed or all vlans allowed. Also it's nice if users can detect an old buggy qemu. Let's just add an rx state.
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
Markus Armbruster arm...@redhat.com writes: Stefan Hajnoczi stefa...@redhat.com writes: The QEMU process stays running if the test case fails. This patch fixes the leak by installing a SIGABRT signal handler which invokes qtest_end(). In order to make that work for assertion failures during qtest_init(), we need to initialize QTestState fields including file descriptors and pids carefully. qtest_quit() is then safe to call even during qtest_init(). Signed-off-by: Stefan Hajnoczi stefa...@redhat.com --- tests/libqtest.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/tests/libqtest.c b/tests/libqtest.c index 8b2b2d7..09a0481 100644 --- a/tests/libqtest.c +++ b/tests/libqtest.c @@ -44,6 +44,7 @@ struct QTestState bool irq_level[MAX_IRQ]; GString *rx; pid_t qemu_pid; /* our child QEMU process */ +struct sigaction sigact_old; /* restored on exit */ }; #define g_assert_no_errno(ret) do { \ @@ -88,6 +89,11 @@ static int socket_accept(int sock) return ret; } +static void sigabrt_handler(int signo) +{ +qtest_end(); Don't you have to re-raise SIGABRT here, to actually terminate the process? +} + QTestState *qtest_init(const char *extra_args) { QTestState *s; [...]
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
Stefan Hajnoczi stefa...@gmail.com writes: On Mon, Feb 17, 2014 at 06:00:03PM +0100, Markus Armbruster wrote: Paolo Bonzini pbonz...@redhat.com writes: Il 17/02/2014 16:44, Stefan Hajnoczi ha scritto: } +static void sigabrt_handler(int signo) +{ +qtest_end(); +} + void qtest_quit(QTestState *s) { int status; if (s-qemu_pid != -1) { kill(s-qemu_pid, SIGTERM); waitpid(s-qemu_pid, status, 0); } close(s-fd); close(s-qmp_fd); g_string_free(s-rx, true); g_free(s); } Not async-signal safe. You need to ignore the g_string_free and g_free (perhaps even the closes) if calling from the sigabrt_handler. kill(), waitpid() and close() are all async-signal-safe. SIGABRT is normally synchronous enough: it's sent by abort(). But of course, nothing stops the user from kill -ABRT. Or GLib from calling abort() in some place where an attempt to reenter it crashes burns. Not sure I'd care, but I'm pretty sure I don't care for freeing stuff on exit :) Yes, SIGABRT is synchronous for all purposes. So the only danger is that g_string_free() or g_free() could fail while we're in g_assert(false). But they don't, which makes sense because they are totally unrelated to g_assert() and therefore can handle re-entrancy. The (theoretical!) problem is when it aborts in the bowels of g_*free(), and your SIGABRT handler reenters g_*free(). In practice there is no issue and I've tested assertion failure with glib 1.2.10. Worst that can happen is we crash on the way from abort() to process termination. Tolerable. Still, avoiding unnecessary cleanup on that path seems prudent to me. If you agree, factor out the kill()/waitpid(), and call only that from the signal handler.
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
Il 18/02/2014 10:05, Stefan Hajnoczi ha scritto: SIGABRT is normally synchronous enough: it's sent by abort(). But of course, nothing stops the user from kill -ABRT. Or GLib from calling abort() in some place where an attempt to reenter it crashes burns. Not sure I'd care, but I'm pretty sure I don't care for freeing stuff on exit :) Yes, SIGABRT is synchronous for all purposes. So the only danger is that g_string_free() or g_free() could fail while we're in g_assert(false). But they don't, which makes sense because they are totally unrelated to g_assert() and therefore can handle re-entrancy. If malloc aborts due to a double free or other similar problem, you may risk reentering it. Paolo
[Qemu-devel] [PATCH v2 1/2] include/qemu/crc32c.h: Rename include guards to match filename
Signed-off-by: Will Newton will.new...@linaro.org Reviewed-by: Peter Maydell peter.mayd...@linaro.org --- include/qemu/crc32c.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/qemu/crc32c.h b/include/qemu/crc32c.h index 56d1c3b..dafb6a1 100644 --- a/include/qemu/crc32c.h +++ b/include/qemu/crc32c.h @@ -25,8 +25,8 @@ * */ -#ifndef QEMU_CRC32_H -#define QEMU_CRC32_H +#ifndef QEMU_CRC32C_H +#define QEMU_CRC32C_H #include qemu-common.h -- 1.8.1.4
[Qemu-devel] [PATCH v2 0/2] target-arm: Add support for AArch32 ARMv8 CRC32 instructions
This series adds support for the AArch32 CRC32 instructions added in ARMv8. Will Newton (2): include/qemu/crc32c.h: Rename include guards to match filename target-arm: Add support for AArch32 ARMv8 CRC32 instructions configure | 2 +- include/qemu/crc32c.h | 4 ++-- target-arm/helper.c| 39 +++ target-arm/helper.h| 3 +++ target-arm/translate.c | 48 5 files changed, 93 insertions(+), 3 deletions(-) -- 1.8.1.4
[Qemu-devel] [PATCH v2 2/2] target-arm: Add support for AArch32 ARMv8 CRC32 instructions
Add support for AArch32 CRC32 and CRC32C instructions added in ARMv8. The CRC32-C implementation used is the built-in qemu implementation and The CRC-32 implementation is from zlib. This requires adding zlib to LIBS to ensure it is linked for the linux-user binary. Signed-off-by: Will Newton will.new...@linaro.org --- configure | 2 +- target-arm/helper.c| 39 +++ target-arm/helper.h| 3 +++ target-arm/translate.c | 48 4 files changed, 91 insertions(+), 1 deletion(-) diff --git a/configure b/configure index 4648117..822842c 100755 --- a/configure +++ b/configure @@ -1550,7 +1550,7 @@ EOF Make sure to have the zlib libs and headers installed. fi fi -libs_softmmu=$libs_softmmu -lz +LIBS=$LIBS -lz ## # libseccomp check diff --git a/target-arm/helper.c b/target-arm/helper.c index 5ae08c9..294cfaf 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -5,6 +5,8 @@ #include sysemu/arch_init.h #include sysemu/sysemu.h #include qemu/bitops.h +#include qemu/crc32c.h +#include zlib.h /* For crc32 */ #ifndef CONFIG_USER_ONLY static inline int get_phys_addr(CPUARMState *env, uint32_t address, @@ -4468,3 +4470,40 @@ int arm_rmode_to_sf(int rmode) } return rmode; } + +static void crc_init_buffer(uint8_t *buf, uint32_t val, uint32_t bytes) +{ +memset(buf, 0, 4); + +if (bytes == 1) { +buf[0] = val 0xff; +} else if (bytes == 2) { +buf[0] = val 0xff; +buf[1] = (val 8) 0xff; +} else { +buf[0] = val 0xff; +buf[1] = (val 8) 0xff; +buf[2] = (val 16) 0xff; +buf[3] = (val 24) 0xff; +} +} + +uint32_t HELPER(crc32)(uint32_t acc, uint32_t val, uint32_t bytes) +{ +uint8_t buf[4]; + +crc_init_buffer(buf, val, bytes); + +/* zlib crc32 converts the accumulator and output to one's complement. */ +return crc32(acc ^ 0x, buf, bytes) ^ 0x; +} + +uint32_t HELPER(crc32c)(uint32_t acc, uint32_t val, uint32_t bytes) +{ +uint8_t buf[4]; + +crc_init_buffer(buf, val, bytes); + +/* Linux crc32c converts the output to one's complement. */ +return crc32c(acc, buf, bytes) ^ 0x; +} diff --git a/target-arm/helper.h b/target-arm/helper.h index 951e6ad..fb92e53 100644 --- a/target-arm/helper.h +++ b/target-arm/helper.h @@ -494,6 +494,9 @@ DEF_HELPER_3(neon_qzip32, void, env, i32, i32) DEF_HELPER_4(crypto_aese, void, env, i32, i32, i32) DEF_HELPER_4(crypto_aesmc, void, env, i32, i32, i32) +DEF_HELPER_3(crc32, i32, i32, i32, i32) +DEF_HELPER_3(crc32c, i32, i32, i32, i32) + #ifdef TARGET_AARCH64 #include helper-a64.h #endif diff --git a/target-arm/translate.c b/target-arm/translate.c index 782aab8..9410b6a 100644 --- a/target-arm/translate.c +++ b/target-arm/translate.c @@ -7541,6 +7541,32 @@ static void disas_arm_insn(CPUARMState * env, DisasContext *s) store_reg(s, 14, tmp2); gen_bx(s, tmp); break; +case 0x4: +{ +/* crc32/crc32c */ +uint32_t c = extract32(insn, 9, 1); + +/* size == 64 is UNPREDICTABLE but handle as UNDEFINED. */ +if (op1 == 0x3) { +goto illegal_op; +} + +rn = (insn 16) 0xf; +rd = (insn 12) 0xf; + +tmp = load_reg(s, rn); +tmp2 = load_reg(s, rm); +tmp3 = tcg_const_i32(1 op1); +if (c) { +gen_helper_crc32c(tmp, tmp, tmp2, tmp3); +} else { +gen_helper_crc32(tmp, tmp, tmp2, tmp3); +} +tcg_temp_free_i32(tmp2); +tcg_temp_free_i32(tmp3); +store_reg(s, rd, tmp); +break; +} case 0x5: /* saturating add/subtract */ ARCH(5TE); rd = (insn 12) 0xf; @@ -9125,6 +9151,28 @@ static int disas_thumb2_insn(CPUARMState *env, DisasContext *s, uint16_t insn_hw case 0x18: /* clz */ gen_helper_clz(tmp, tmp); break; +case 0x20: +case 0x21: +case 0x22: +case 0x28: +case 0x29: +case 0x2a: +{ +/* crc32/crc32c */ +uint32_t sz = op 0x3; +uint32_t c = op 0x8; + +tmp2 = load_reg(s, rm); +tmp3 = tcg_const_i32(1 sz); +if (c) { +gen_helper_crc32c(tmp, tmp, tmp2, tmp3); +} else { +gen_helper_crc32(tmp, tmp, tmp2, tmp3); +} +tcg_temp_free_i32(tmp2); +tcg_temp_free_i32(tmp3); +break; +} default:
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
Am 18.02.2014 10:46, schrieb Peter Maydell: On 18 February 2014 01:02, Andreas Färber afaer...@suse.de wrote: Shouldn't 1/3 and 3/3 be included in 1.7.1 as guest-triggerable bugs? There's no qemu-sta...@nongnu.org in the commit message. That's up to whoever maintains stable, I guess. I have no objection. No, we've had that topic before: It's your job as submitter and maintainer to flag that appropriately in the commit message, as per QEMU Summit 2012. Thanks, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
On Tue, Feb 18, 2014 at 11:07:53AM +0100, Paolo Bonzini wrote: Il 18/02/2014 10:05, Stefan Hajnoczi ha scritto: SIGABRT is normally synchronous enough: it's sent by abort(). But of course, nothing stops the user from kill -ABRT. Or GLib from calling abort() in some place where an attempt to reenter it crashes burns. Not sure I'd care, but I'm pretty sure I don't care for freeing stuff on exit :) Yes, SIGABRT is synchronous for all purposes. So the only danger is that g_string_free() or g_free() could fail while we're in g_assert(false). But they don't, which makes sense because they are totally unrelated to g_assert() and therefore can handle re-entrancy. If malloc aborts due to a double free or other similar problem, you may risk reentering it. If you register the custom SIGABRT handler with sigaction + SA_RESETHAND then you'd avoid the re-entrancy risk, since a cascading SIGABRT would get handled by the system default handler, which would immediately terminate the process. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Re: [Qemu-devel] [PATCH v2, Ping] SMBIOS: Upgrade Type17 to v2.3, add Type2
On Mo, 2014-02-17 at 15:33 -0500, Kevin O'Connor wrote: On Mon, Feb 17, 2014 at 11:09:48AM -0500, Gabriel L. Somlo wrote: On Fri, Feb 07, 2014 at 04:37:58PM +0100, Paolo Bonzini wrote: Il 06/02/2014 14:38, Gabriel L. Somlo ha scritto: On Wed, Feb 05, 2014 at 08:02:25PM -0500, Kevin O'Connor wrote: Thanks. In general, though, it is preferred to make smbios changes at the QEMU layer. Indeed, going forward, I'd like to see all the smbios stuff moved up to QEMU. Is there something in these two patches that can't be done in QEMU? But anyhow, right now QEMU seems to be making relatively minor tweaks to something that's firmly at home in SeaBIOS, which is why I sent my patches to the latter... Yeah, that's correct. There's really no particular hook in QEMU to do this. Would it be OK to apply this in SeaBIOS now, so that 1. it can be useful until whenever SMBIOS gets transferred/absorbed into QEMU, and 2. it won't fall through the cracks when that transition happens ? Unfortunately, if we change the smbios in SeaBIOS, it will show up on all machine types that use the new version of SeaBIOS. We've had issues with this type of change before as various OSes react differently to the change. Gerd, what's your take on this change? I think we can do this in qemu, without touching seabios at all. The currently used fw_cfg interface allows to pass both individual fields and complete tables. In practice the individual fields interface is used for type0 and type1 table fields. I think the only way to pass complete tables is 'qemu -smbios file=pregenerated-blob-here'. If seabios finds a table provided by qemu it used it, otherwise it (possibly) generates its own. So we can smoothly switch over to qemu, table-by-table. You can have qemu provide type2+type17 tables, and leave everything else as-is. And when doing it in qemu it is easy to do it for new machine types only. cheers, Gerd
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
Il 18/02/2014 11:17, Daniel P. Berrange ha scritto: Yes, SIGABRT is synchronous for all purposes. So the only danger is that g_string_free() or g_free() could fail while we're in g_assert(false). But they don't, which makes sense because they are totally unrelated to g_assert() and therefore can handle re-entrancy. If malloc aborts due to a double free or other similar problem, you may risk reentering it. If you register the custom SIGABRT handler with sigaction + SA_RESETHAND then you'd avoid the re-entrancy risk, since a cascading SIGABRT would get handled by the system default handler, which would immediately terminate the process. I meant reentering malloc. Paolo
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
Il 18/02/2014 11:05, Markus Armbruster ha scritto: Yes, SIGABRT is synchronous for all purposes. So the only danger is that g_string_free() or g_free() could fail while we're in g_assert(false). But they don't, which makes sense because they are totally unrelated to g_assert() and therefore can handle re-entrancy. The (theoretical!) problem is when it aborts in the bowels of g_*free(), and your SIGABRT handler reenters g_*free(). In practice there is no issue and I've tested assertion failure with glib 1.2.10. Worst that can happen is we crash on the way from abort() to process termination. Tolerable. What about recursive locking of a non-recursive mutex? Paolo
Re: [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage
* mrhi...@linux.vnet.ibm.com (mrhi...@linux.vnet.ibm.com) wrote: From: Michael R. Hines mrhi...@us.ibm.com We also later export these statistics over QMP for better monitoring of micro-checkpointing as the workload changes. snip @@ -548,9 +568,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage) /* XBZRLE overflow or normal page */ if (bytes_sent == -1) { bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE); -qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE); -bytes_sent += TARGET_PAGE_SIZE; -acct_info.norm_pages++; +if (ret != RAM_SAVE_CONTROL_DELAYED) { +qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE); +bytes_sent += TARGET_PAGE_SIZE; +acct_info.norm_pages++; +} } Is that last change intended for this patch; it doesn't look timestamp related. Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] hotplug: VM got stuck when attaching a pass-through device to the non-pass-through VM for the first time
On Tue, Feb 18, 2014 at 02:38:40AM +, Zhanghaoyu (A) wrote: Hi, all The VM will get stuck for a while(about 6s for a VM with 20GB memory) when attaching a pass-through PCI card to the non-pass-through VM for the first time. The reason is that the host will build the whole VT-d GPA-HPA DMAR page-table, which needs a lot of time, and during this time, the qemu_global_mutex lock is hold by the main-thread, if the vcpu thread IOCTL return, it will be blocked to waiting main-thread to release the qemu_global_mutex lock, so the VM got stuck. The race between qemu-main-thread and vcpu-thread is shown as below, QEMU-main-threadvcpu-thread | | qemu_mutex_lock_iothread qemu_mutex_lock(qemu_global_mutex) | | +loop- -+ +loop+ || | | | qemu_mutex_unlock_iothread| qemu_mutex_unlock_iothread || | | | poll | kvm_vcpu_ioctl(KVM_RUN) || | | | qemu_mutex_lock_iothread | | || | | -- || | qemu_mutex_lock_iothread | kvm_device_pci_assign| | || | blocked to waiting main-thread to release the qemu lock | about 6 sec for 20GB memory | | || | | ++ +-+ Any advises? Thanks, Zhang Haoyu What if you detach and re-attach? Is it fast then? If yes this means the issue is COW breaking that occurs with get_user_pages, not translation as such. Try hugepages with prealloc - does it help? -- MST
Re: [Qemu-devel] hotplug: VM got stuck when attaching a pass-through device to the non-pass-through VM for the first time
Il 18/02/2014 11:38, Michael S. Tsirkin ha scritto: What if you detach and re-attach? Is it fast then? If yes this means the issue is COW breaking that occurs with get_user_pages, not translation as such. Try hugepages with prealloc - does it help? I agree it's either COW breaking or (similarly) locking pages that the guest hasn't touched yet. You can use prealloc or -rt mlock=on to avoid this problem. Paolo
Re: [Qemu-devel] [PATCH 3/3] qtest: kill QEMU process on g_assert() failure
Paolo Bonzini pbonz...@redhat.com writes: Il 18/02/2014 11:05, Markus Armbruster ha scritto: Yes, SIGABRT is synchronous for all purposes. So the only danger is that g_string_free() or g_free() could fail while we're in g_assert(false). But they don't, which makes sense because they are totally unrelated to g_assert() and therefore can handle re-entrancy. The (theoretical!) problem is when it aborts in the bowels of g_*free(), and your SIGABRT handler reenters g_*free(). In practice there is no issue and I've tested assertion failure with glib 1.2.10. Worst that can happen is we crash on the way from abort() to process termination. Tolerable. What about recursive locking of a non-recursive mutex? You're right, deadlock is possible. Strengthens the argument for: Still, avoiding unnecessary cleanup on that path seems prudent to me. If you agree, factor out the kill()/waitpid(), and call only that from the signal handler.
Re: [Qemu-devel] hotplug: VM got stuck when attaching a pass-through device to the non-pass-through VM for the first time
On Tue, Feb 18, 2014 at 11:42:19AM +0100, Paolo Bonzini wrote: Il 18/02/2014 11:38, Michael S. Tsirkin ha scritto: What if you detach and re-attach? Is it fast then? If yes this means the issue is COW breaking that occurs with get_user_pages, not translation as such. Try hugepages with prealloc - does it help? I agree it's either COW breaking or (similarly) locking pages that the guest hasn't touched yet. You can use prealloc or -rt mlock=on to avoid this problem. Paolo Or the new shared flag - IIRC shared VMAs don't do COW either.
Re: [Qemu-devel] hotplug: VM got stuck when attaching a pass-through device to the non-pass-through VM for the first time
Hi, all The VM will get stuck for a while(about 6s for a VM with 20GB memory) when attaching a pass-through PCI card to the non-pass-through VM for the first time. The reason is that the host will build the whole VT-d GPA-HPA DMAR page-table, which needs a lot of time, and during this time, the qemu_global_mutex lock is hold by the main-thread, if the vcpu thread IOCTL return, it will be blocked to waiting main-thread to release the qemu_global_mutex lock, so the VM got stuck. The race between qemu-main-thread and vcpu-thread is shown as below, QEMU-main-threadvcpu-thread | | qemu_mutex_lock_iothread qemu_mutex_lock(qemu_global_mutex) | | +loop- -+ +loop+ || | | | qemu_mutex_unlock_iothread| qemu_mutex_unlock_iothread || | | | poll | kvm_vcpu_ioctl(KVM_RUN) || | | | qemu_mutex_lock_iothread | | || | | -- || | qemu_mutex_lock_iothread | kvm_device_pci_assign| | || | blocked to waiting main-thread to release the qemu lock | about 6 sec for 20GB memory | | || | | ++ +-+ Any advises? Thanks, Zhang Haoyu What if you detach and re-attach? Is it fast then? Yes, because the VT-d GPA-HPA DMAR page-table has been built, no need to re-build it. If yes this means the issue is COW breaking that occurs with get_user_pages, not translation as such. Try hugepages with prealloc - does it help? Yes, a bit help gained, but it cannot resolve the problem completely, the stuck still happened. Thanks, Zhang Haoyu
Re: [Qemu-devel] hotplug: VM got stuck when attaching a pass-through device to the non-pass-through VM for the first time
Il 18/02/2014 11:51, Michael S. Tsirkin ha scritto: I agree it's either COW breaking or (similarly) locking pages that the guest hasn't touched yet. You can use prealloc or -rt mlock=on to avoid this problem. Paolo Or the new shared flag - IIRC shared VMAs don't do COW either. Only if the problem isn't locking and zeroing of untouched pages (also, it is not upstream is it?). Can you make a profile with perf? Paolo
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 18 February 2014 10:13, Andreas Färber afaer...@suse.de wrote: Am 18.02.2014 10:46, schrieb Peter Maydell: On 18 February 2014 01:02, Andreas Färber afaer...@suse.de wrote: Shouldn't 1/3 and 3/3 be included in 1.7.1 as guest-triggerable bugs? There's no qemu-sta...@nongnu.org in the commit message. That's up to whoever maintains stable, I guess. I have no objection. No, we've had that topic before: It's your job as submitter and maintainer to flag that appropriately in the commit message, as per QEMU Summit 2012. I don't think this workflow works. I have no idea what stable's criteria are, and if you rely on people adding a cc you're going to miss stuff. thanks -- PMM
Re: [Qemu-devel] [RFC 0/9] generate dynamic _CRS for motherboard resources
On Mon, 17 Feb 2014 13:02:18 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Mon, Feb 17, 2014 at 11:33:23AM +0100, Igor Mammedov wrote: On Sun, 16 Feb 2014 17:53:45 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Fri, Feb 07, 2014 at 01:51:27PM +0100, Igor Mammedov wrote: Since introduction of PCIHP, it became problematic to punch hole in PCI0._CRS statically since PCI hotplug region size became runtime changeable. What makes it runtime changeable? piix machine acpi-pci-hotplug-with-bridge-support=on effectively changes size of pcihp MMIO region It adds 4 bytes. Let's just reserve a reasonably sized region, like 512 bytes. Reserve where and for how log it will be enough? Eventually we will continue punching holes or increasing size of this one big hole. Q35 adds more to the problem with guest programmable pm_io_base, we can't statically punch hole for it. So replace static hole punching with dynamically consumed resources in a child device on PCI0 bus. i.e generate PNP0C02 device as a child of PCI0 bus at runtime and consume GPE0, PCI/CPU hotplug IO resources in it instead of punching holes in static PCI0._CRS. It seems that we are being too exact with IO resources here. Can't we roughly reserve 0xae00 to 0xafff and be done with it? that would be easiest way for this specific case if we could agree for ranges on PIIX/Q35 machines. But I also use it as excuse to introduce ASL like macros so that building dynamic SSDT would be easier i.e. replace template patching with single place where dynamic device is defined and its values are set in simple and straightforward manner. We don't need excuses really :) But the approach itself needs some work IMHO, in particular ASL like syntax by using macros and varargs is not something to strive for IMHO :) Why don't we just pass GArrays around? crs = build_alloc_crs(); build_append_resource(crs, ) I don't like much macros myself, but compared to alternative approach ^^^ (I've tried it) it's still harder to read/write ASL constructs right. I think that creating AML using ASL like flow and primitives is better than building it using custom APIs, since it looks like it was documented in spec. Macros+vararg provide a necessary degree of flexibility to create AML using ASL like flow and constructs. Or if you really want this, using array of GArray: build_append_crs(ssdt, { build_alloc_resource(...), build_alloc_resource(...), build_alloc_resource(...), build_alloc_resource(...), NULL }) that would work if there was a way to specify arbitrary statements inside embedded array (like in the last hunk of [9/9]). Tested with Windows XPsp3, Vista, Windows Server 2003, 2008, 2012r2. PS: Series adds several ASL like macros to simplify code for dynamic generation of AML structures. Igor Mammedov (9): Revert pc: Q35 DSDT: exclude CPU hotplug IO range from PCI bus resources Revert pc: PIIX DSDT: exclude CPU/PCI hotplug GPE0 IO range from PCI bus resources Partial revert pc: ACPI: expose PRST IO range via _CRS acpi: replace opencoded opcodes with defines acpi: add PNP0C02 to PCI0 bus acpi: consume GPE0 IO resources in PNP0C02 device acpi: consume CPU hotplug IO resource in PNP0C02 device pcihp: expose PCI hotplug MMIO base/length as properties of piix4pm acpi: consume PCIHP IO resource in PNP0C02 device hw/acpi/pcihp.c | 28 ++ hw/acpi/piix4.c |1 + hw/i386/acpi-build.c | 177 ++-- hw/i386/acpi-dsdt-cpu-hotplug.dsl | 11 --- hw/i386/acpi-dsdt-pci-crs.dsl | 15 +++- hw/i386/acpi-dsdt.dsl | 39 hw/i386/q35-acpi-dsdt.dsl | 16 include/hw/acpi/pcihp.h |4 + 8 files changed, 214 insertions(+), 77 deletions(-) -- Regards, Igor
Re: [Qemu-devel] hotplug: VM got stuck when attaching a pass-through device to the non-pass-through VM for the first time
On Tue, Feb 18, 2014 at 12:05:19PM +0100, Paolo Bonzini wrote: Il 18/02/2014 11:51, Michael S. Tsirkin ha scritto: I agree it's either COW breaking or (similarly) locking pages that the guest hasn't touched yet. You can use prealloc or -rt mlock=on to avoid this problem. Paolo Or the new shared flag - IIRC shared VMAs don't do COW either. Only if the problem isn't locking and zeroing of untouched pages (also, it is not upstream is it?). No but it's a small patch - part of vhost user patchset. Can you make a profile with perf? Paolo
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
Il 18/02/2014 12:09, Peter Maydell ha scritto: No, we've had that topic before: It's your job as submitter and maintainer to flag that appropriately in the commit message, as per QEMU Summit 2012. I don't think this workflow works. I have no idea what stable's criteria are, and if you rely on people adding a cc you're going to miss stuff. There isn't really a standard criterion. It's up to each maintainer to be stricter or looser on what goes to stable. Paolo
Re: [Qemu-devel] hotplug: VM got stuck when attaching a pass-through device to the non-pass-through VM for the first time
What if you detach and re-attach? Is it fast then? If yes this means the issue is COW breaking that occurs with get_user_pages, not translation as such. Try hugepages with prealloc - does it help? I agree it's either COW breaking or (similarly) locking pages that the guest hasn't touched yet. You can use prealloc or -rt mlock=on to avoid this problem. It gets better if using -rt mlock=on, but still cannot resolve the problem completely. VT-d and EPT do not share the GPA-HPA page-table, still need to build VT-d GPA-HPA DMAR page-table, Although the -rt mlock=on option guarantees that all of vm memory have been touched before attaching the pass-through device, the building is faster, but which still need some time. Thanks, Zhang Haoyu Paolo
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 18 February 2014 11:16, Paolo Bonzini pbonz...@redhat.com wrote: Il 18/02/2014 12:09, Peter Maydell ha scritto: No, we've had that topic before: It's your job as submitter and maintainer to flag that appropriately in the commit message, as per QEMU Summit 2012. I don't think this workflow works. I have no idea what stable's criteria are, and if you rely on people adding a cc you're going to miss stuff. There isn't really a standard criterion. It's up to each maintainer to be stricter or looser on what goes to stable. My criteria for ARM in the past has typically been there's a new release every three months, anything that got past the release testing process for release N is sufficiently non-critical it can just go into release N+1. thanks -- PMM
Re: [Qemu-devel] [PATCH v2 2/2] target-arm: Add support for AArch32ARMv8 CRC32 instructionss
will.new...@linaro.org writes: Add support for AArch32 CRC32 and CRC32C instructions added in ARMv8. The CRC32-C implementation used is the built-in qemu implementation and The CRC-32 implementation is from zlib. This requires adding zlib to LIBS to ensure it is linked for the linux-user binary. Signed-off-by: Will Newton will.new...@linaro.org --- configure | 2 +- target-arm/helper.c| 39 +++ target-arm/helper.h| 3 +++ target-arm/translate.c | 48 4 files changed, 91 insertions(+), 1 deletion(-) diff --git a/configure b/configure index 4648117..822842c 100755 --- a/configure +++ b/configure @@ -1550,7 +1550,7 @@ EOF Make sure to have the zlib libs and headers installed. fi fi -libs_softmmu=$libs_softmmu -lz +LIBS=$LIBS -lz ## # libseccomp check diff --git a/target-arm/helper.c b/target-arm/helper.c index 5ae08c9..294cfaf 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -5,6 +5,8 @@ #include sysemu/arch_init.h #include sysemu/sysemu.h #include qemu/bitops.h +#include qemu/crc32c.h +#include zlib.h /* For crc32 */ #ifndef CONFIG_USER_ONLY static inline int get_phys_addr(CPUARMState *env, uint32_t address, @@ -4468,3 +4470,40 @@ int arm_rmode_to_sf(int rmode) } return rmode; } + +static void crc_init_buffer(uint8_t *buf, uint32_t val, uint32_t bytes) +{ +memset(buf, 0, 4); + +if (bytes == 1) { +buf[0] = val 0xff; +} else if (bytes == 2) { +buf[0] = val 0xff; +buf[1] = (val 8) 0xff; +} else { +buf[0] = val 0xff; +buf[1] = (val 8) 0xff; +buf[2] = (val 16) 0xff; +buf[3] = (val 24) 0xff; +} +} + +uint32_t HELPER(crc32)(uint32_t acc, uint32_t val, uint32_t bytes) +{ +uint8_t buf[4]; + +crc_init_buffer(buf, val, bytes); + +/* zlib crc32 converts the accumulator and output to one's complement. */ +return crc32(acc ^ 0x, buf, bytes) ^ 0x; +} + +uint32_t HELPER(crc32c)(uint32_t acc, uint32_t val, uint32_t bytes) +{ +uint8_t buf[4]; + +crc_init_buffer(buf, val, bytes); + +/* Linux crc32c converts the output to one's complement. */ +return crc32c(acc, buf, bytes) ^ 0x; +} diff --git a/target-arm/helper.h b/target-arm/helper.h index 951e6ad..fb92e53 100644 --- a/target-arm/helper.h +++ b/target-arm/helper.h @@ -494,6 +494,9 @@ DEF_HELPER_3(neon_qzip32, void, env, i32, i32) DEF_HELPER_4(crypto_aese, void, env, i32, i32, i32) DEF_HELPER_4(crypto_aesmc, void, env, i32, i32, i32) +DEF_HELPER_3(crc32, i32, i32, i32, i32) +DEF_HELPER_3(crc32c, i32, i32, i32, i32) + #ifdef TARGET_AARCH64 #include helper-a64.h #endif diff --git a/target-arm/translate.c b/target-arm/translate.c index 782aab8..9410b6a 100644 --- a/target-arm/translate.c +++ b/target-arm/translate.c @@ -7541,6 +7541,32 @@ static void disas_arm_insn(CPUARMState * env, DisasContext *s) store_reg(s, 14, tmp2); gen_bx(s, tmp); break; +case 0x4: +{ +/* crc32/crc32c */ +uint32_t c = extract32(insn, 9, 1); + +/* size == 64 is UNPREDICTABLE but handle as UNDEFINED. */ +if (op1 == 0x3) { +goto illegal_op; +} + +rn = (insn 16) 0xf; +rd = (insn 12) 0xf; use extract32? + +tmp = load_reg(s, rn); +tmp2 = load_reg(s, rm); +tmp3 = tcg_const_i32(1 op1); +if (c) { +gen_helper_crc32c(tmp, tmp, tmp2, tmp3); +} else { +gen_helper_crc32(tmp, tmp, tmp2, tmp3); +} +tcg_temp_free_i32(tmp2); +tcg_temp_free_i32(tmp3); +store_reg(s, rd, tmp); +break; +} case 0x5: /* saturating add/subtract */ ARCH(5TE); rd = (insn 12) 0xf; @@ -9125,6 +9151,28 @@ static int disas_thumb2_insn(CPUARMState *env, DisasContext *s, uint16_t insn_hw case 0x18: /* clz */ gen_helper_clz(tmp, tmp); break; +case 0x20: +case 0x21: +case 0x22: +case 0x28: +case 0x29: +case 0x2a: +{ +/* crc32/crc32c */ +uint32_t sz = op 0x3; +uint32_t c = op 0x8; + +tmp2 = load_reg(s, rm); +tmp3 = tcg_const_i32(1 sz); +if (c) { +gen_helper_crc32c(tmp, tmp, tmp2, tmp3); +} else { +gen_helper_crc32(tmp, tmp, tmp2, tmp3); +
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
Il 18/02/2014 12:22, Peter Maydell ha scritto: There isn't really a standard criterion. It's up to each maintainer to be stricter or looser on what goes to stable. My criteria for ARM in the past has typically been there's a new release every three months, anything that got past the release testing process for release N is sufficiently non-critical it can just go into release N+1. Then you can ignore qemu-stable. :) Paolo
Re: [Qemu-devel] [RFC 0/9] generate dynamic _CRS for motherboard resources
On Tue, Feb 18, 2014 at 12:10:27PM +0100, Igor Mammedov wrote: On Mon, 17 Feb 2014 13:02:18 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Mon, Feb 17, 2014 at 11:33:23AM +0100, Igor Mammedov wrote: On Sun, 16 Feb 2014 17:53:45 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Fri, Feb 07, 2014 at 01:51:27PM +0100, Igor Mammedov wrote: Since introduction of PCIHP, it became problematic to punch hole in PCI0._CRS statically since PCI hotplug region size became runtime changeable. What makes it runtime changeable? piix machine acpi-pci-hotplug-with-bridge-support=on effectively changes size of pcihp MMIO region It adds 4 bytes. Let's just reserve a reasonably sized region, like 512 bytes. Reserve where and for how log it will be enough? Eventually we will continue punching holes or increasing size of this one big hole. I agree this might become necessary. But why not keep it simple while we can? The advantage of ACPI in qemu is we can fix it without touching guest code. Let's use it. So if there's just pm_io_base, we can just write it in ASL and patch in that value. Q35 adds more to the problem with guest programmable pm_io_base, we can't statically punch hole for it. Aha. That's a good point. But I still worry how we'll handle e.g. pvpanic. Paolo had an idea to scan the memory range, and punch holes for each section, making an exception for regions owned by PCI devices. I thought it would work originally too, but it seems that e.g. VGA range is also owned by pci devices sometimes. So replace static hole punching with dynamically consumed resources in a child device on PCI0 bus. i.e generate PNP0C02 device as a child of PCI0 bus at runtime and consume GPE0, PCI/CPU hotplug IO resources in it instead of punching holes in static PCI0._CRS. It seems that we are being too exact with IO resources here. Can't we roughly reserve 0xae00 to 0xafff and be done with it? that would be easiest way for this specific case if we could agree for ranges on PIIX/Q35 machines. But I also use it as excuse to introduce ASL like macros so that building dynamic SSDT would be easier i.e. replace template patching with single place where dynamic device is defined and its values are set in simple and straightforward manner. We don't need excuses really :) But the approach itself needs some work IMHO, in particular ASL like syntax by using macros and varargs is not something to strive for IMHO :) Why don't we just pass GArrays around? crs = build_alloc_crs(); build_append_resource(crs, ) I don't like much macros myself, but compared to alternative approach ^^^ (I've tried it) it's still harder to read/write ASL constructs right. I think that creating AML using ASL like flow and primitives is better than building it using custom APIs, since it looks like it was documented in spec. Macros+vararg provide a necessary degree of flexibility to create AML using ASL like flow and constructs. Only you lose all type safety. When compiling ASL with IASL at least it does some type checking. Or if you really want this, using array of GArray: build_append_crs(ssdt, { build_alloc_resource(...), build_alloc_resource(...), build_alloc_resource(...), build_alloc_resource(...), NULL }) that would work if there was a way to specify arbitrary statements inside embedded array (like in the last hunk of [9/9]). In C we have functions for this :) Last chunk: -if (pm-pcihp_io_base) { -ACPI_IO(RESBUF, Decode16, -pm-pcihp_io_base, /* _MIN */ -pm-pcihp_io_base, /* _MAX */ -0x0, /* _ALN */ -pm-pcihp_io_len); /* _LEN */ -} + build_io_resource_16(pm-pcihp_io_base, pm-pcihp_io_base, 0x0, pm-pcihp_io_len); Which actually makes one think. Isn't _MAX wrong here? I'd expect pm-pcihp_io_base + pm-pcihp_io_len - 1 I would also key this off pcihp_io_len. Tested with Windows XPsp3, Vista, Windows Server 2003, 2008, 2012r2. PS: Series adds several ASL like macros to simplify code for dynamic generation of AML structures. Igor Mammedov (9): Revert pc: Q35 DSDT: exclude CPU hotplug IO range from PCI bus resources Revert pc: PIIX DSDT: exclude CPU/PCI hotplug GPE0 IO range from PCI bus resources Partial revert pc: ACPI: expose PRST IO range via _CRS acpi: replace opencoded opcodes with defines acpi: add PNP0C02 to PCI0 bus acpi: consume GPE0 IO resources in PNP0C02 device acpi: consume
Re: [Qemu-devel] [PATCH v2] qcow2: Set zero flag for discarded clusters
Am 17.02.2014 um 18:12 hat Eric Blake geschrieben: On 02/17/2014 07:45 AM, Kevin Wolf wrote: Instead of making the backing file contents visible again after a discard request, set the zero flag if possible (i.e. on version = 3). Signed-off-by: Kevin Wolf kw...@redhat.com --- block/qcow2-cluster.c | 22 -- tests/qemu-iotests/046 | 18 ++ 2 files changed, 34 insertions(+), 6 deletions(-) +if (!!(old_offset QCOW_OFLAG_ZERO)) { The !! is not necessary here; any non-zero value in a boolean context gives the same result as an explicit conversion to 0-or-1. Thanks, that looks a bit odd indeed. I'll change it. Kevin pgpQg0digIRKr.pgp Description: PGP signature
[Qemu-devel] [PATCH] block/iscsi: fix deadlock on scsi check condition
the retry logic was broken because the complete status of the task structure was not reset. this resulted in an infinite loop retrying the command over and over. CC: qemu-sta...@nongnu.org Signed-off-by: Peter Lieven p...@kamp.de --- block/iscsi.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/block/iscsi.c b/block/iscsi.c index 3c0b728..ded414e 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -145,12 +145,13 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status, if (iTask-retries-- 0 status == SCSI_STATUS_CHECK_CONDITION task-sense.key == SCSI_SENSE_UNIT_ATTENTION) { +error_report(iSCSI CheckCondition: %s, iscsi_get_error(iscsi)); iTask-do_retry = 1; goto out; } if (status != SCSI_STATUS_GOOD) { -error_report(iSCSI: Failure. %s, iscsi_get_error(iscsi)); +error_report(iSCSI Failure: %s, iscsi_get_error(iscsi)); } out: @@ -325,6 +326,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -399,6 +401,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -433,6 +436,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -683,6 +687,7 @@ retry: scsi_free_scsi_task(iTask.task); iTask.task = NULL; } +iTask.complete = 0; goto retry; } @@ -767,6 +772,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -836,6 +842,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } -- 1.7.9.5
[Qemu-devel] [PATCH V18 02/12] quorum: Create BDRVQuorumState and BlkDriver and do init.
From: Benoît Canet ben...@irqsave.net Create the structure holding the quorum settings and write the minimal block driver instanciation boilerplate. Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- block/quorum.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index 950f5cc..375ffd5 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -15,6 +15,23 @@ #include block/block_int.h +/* the following structure holds the state of one quorum instance */ +typedef struct BDRVQuorumState { +BlockDriverState **bs; /* children BlockDriverStates */ +int num_children; /* children count */ +int threshold; /* if less than threshold children reads gave the +* same result a quorum error occurs. +*/ +bool is_blkverify; /* true if the driver is in blkverify mode +* Writes are mirrored on two children devices. +* On reads the two children devices' contents are +* compared and if a difference is spotted its +* location is printed and the code aborts. +* It is useful to debug other block drivers by +* comparing them with a reference one. +*/ +} BDRVQuorumState; + typedef struct QuorumAIOCB QuorumAIOCB; /* Quorum will create one instance of the following structure per operation it @@ -51,3 +68,17 @@ struct QuorumAIOCB { bool is_read; int vote_ret; }; + +static BlockDriver bdrv_quorum = { +.format_name= quorum, +.protocol_name = quorum, + +.instance_size = sizeof(BDRVQuorumState), +}; + +static void bdrv_quorum_init(void) +{ +bdrv_register(bdrv_quorum); +} + +block_init(bdrv_quorum_init); -- 1.8.3.2
[Qemu-devel] [PATCH V18 08/12] quorum: Add quorum_invalidate_cache().
From: Benoît Canet ben...@irqsave.net We really want that live migration works with quorum so implement quorum_invalidate_cache(). Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- block/quorum.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index 131ffbf..d924055 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -619,6 +619,16 @@ static int64_t quorum_getlength(BlockDriverState *bs) return result; } +static void quorum_invalidate_cache(BlockDriverState *bs) +{ +BDRVQuorumState *s = bs-opaque; +int i; + +for (i = 0; i s-num_children; i++) { +bdrv_invalidate_cache(s-bs[i]); +} +} + static BlockDriver bdrv_quorum = { .format_name= quorum, .protocol_name = quorum, @@ -629,6 +639,7 @@ static BlockDriver bdrv_quorum = { .bdrv_aio_readv = quorum_aio_readv, .bdrv_aio_writev= quorum_aio_writev, +.bdrv_invalidate_cache = quorum_invalidate_cache, }; static void bdrv_quorum_init(void) -- 1.8.3.2
[Qemu-devel] [PATCH V18 03/12] quorum: Add quorum_aio_writev and its dependencies.
From: Benoît Canet ben...@irqsave.net Writes are mirrored num_children times on num_children devices. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/quorum.c | 103 + 1 file changed, 103 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index 375ffd5..99e0e03 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -69,11 +69,114 @@ struct QuorumAIOCB { int vote_ret; }; +static void quorum_aio_cancel(BlockDriverAIOCB *blockacb) +{ +QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common); +BDRVQuorumState *s = acb-common.bs-opaque; +int i; + +/* cancel all callbacks */ +for (i = 0; i s-num_children; i++) { +bdrv_aio_cancel(acb-qcrs[i].aiocb); +} + +g_free(acb-qcrs); +qemu_aio_release(acb); +} + +static AIOCBInfo quorum_aiocb_info = { +.aiocb_size = sizeof(QuorumAIOCB), +.cancel = quorum_aio_cancel, +}; + +static void quorum_aio_finalize(QuorumAIOCB *acb) +{ +int ret = 0; + +acb-common.cb(acb-common.opaque, ret); + +g_free(acb-qcrs); +qemu_aio_release(acb); +} + +static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, + BlockDriverState *bs, + QEMUIOVector *qiov, + uint64_t sector_num, + int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque) +{ +QuorumAIOCB *acb = qemu_aio_get(quorum_aiocb_info, bs, cb, opaque); +int i; + +acb-common.bs-opaque = s; +acb-sector_num = sector_num; +acb-nb_sectors = nb_sectors; +acb-qiov = qiov; +acb-qcrs = g_new0(QuorumChildRequest, s-num_children); +acb-count = 0; +acb-success_count = 0; +acb-is_read = false; +acb-vote_ret = 0; + +for (i = 0; i s-num_children; i++) { +acb-qcrs[i].buf = NULL; +acb-qcrs[i].ret = 0; +acb-qcrs[i].parent = acb; +} + +return acb; +} + +static void quorum_aio_cb(void *opaque, int ret) +{ +QuorumChildRequest *sacb = opaque; +QuorumAIOCB *acb = sacb-parent; +BDRVQuorumState *s = acb-common.bs-opaque; + +sacb-ret = ret; +acb-count++; +if (ret == 0) { +acb-success_count++; +} +assert(acb-count = s-num_children); +assert(acb-success_count = s-num_children); +if (acb-count s-num_children) { +return; +} + +quorum_aio_finalize(acb); +} + +static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs, + int64_t sector_num, + QEMUIOVector *qiov, + int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque) +{ +BDRVQuorumState *s = bs-opaque; +QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, nb_sectors, + cb, opaque); +int i; + +for (i = 0; i s-num_children; i++) { +acb-qcrs[i].aiocb = bdrv_aio_writev(s-bs[i], sector_num, qiov, + nb_sectors, quorum_aio_cb, + acb-qcrs[i]); +} + +return acb-common; +} + static BlockDriver bdrv_quorum = { .format_name= quorum, .protocol_name = quorum, .instance_size = sizeof(BDRVQuorumState), + +.bdrv_aio_writev= quorum_aio_writev, }; static void bdrv_quorum_init(void) -- 1.8.3.2
[Qemu-devel] [PATCH V18 01/12] quorum: Create quorum.c, add QuorumChildRequest and QuorumAIOCB.
From: Benoît Canet ben...@irqsave.net Quorum is a block filter mirroring writes to num_children children. For reads quorum reads each children and does a vote. If more than vote_threshold versions are identical the quorum is reached and this winning version is returned to the guest. So quorum prevents bit corruption. For high availability purpose minority errors are reported via QMP but the guest does not see them. This patch creates the driver C source file and introduces the structures that will be used in asynchronous reads and writes. Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- block/Makefile.objs | 1 + block/quorum.c | 53 + 2 files changed, 54 insertions(+) create mode 100644 block/quorum.c diff --git a/block/Makefile.objs b/block/Makefile.objs index e254a21..716556f 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -3,6 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-$(CONFIG_VHDX) += vhdx.o vhdx-endian.o vhdx-log.o +block-obj-y += quorum.o block-obj-y += parallels.o blkdebug.o blkverify.o block-obj-y += snapshot.o qapi.o block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o diff --git a/block/quorum.c b/block/quorum.c new file mode 100644 index 000..950f5cc --- /dev/null +++ b/block/quorum.c @@ -0,0 +1,53 @@ +/* + * Quorum Block filter + * + * Copyright (C) 2012-2014 Nodalink, EURL. + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * Based on the design and code of blkverify.c (Copyright (C) 2010 IBM, Corp) + * and blkmirror.c (Copyright (C) 2011 Red Hat, Inc). + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include block/block_int.h + +typedef struct QuorumAIOCB QuorumAIOCB; + +/* Quorum will create one instance of the following structure per operation it + * performs on its children. + * So for each read/write operation coming from the upper layer there will be + * $children_count QuorumChildRequest. + */ +typedef struct QuorumChildRequest { +BlockDriverAIOCB *aiocb; +QEMUIOVector qiov; +uint8_t *buf; +int ret; +QuorumAIOCB *parent; +} QuorumChildRequest; + +/* Quorum will use the following structure to track progress of each read/write + * operation received by the upper layer. + * This structure hold pointers to the QuorumChildRequest structures instances + * used to do operations on each children and track overall progress. + */ +struct QuorumAIOCB { +BlockDriverAIOCB common; + +/* Request metadata */ +uint64_t sector_num; +int nb_sectors; + +QEMUIOVector *qiov; /* calling IOV */ + +QuorumChildRequest *qcrs; /* individual child requests */ +int count; /* number of completed AIOCB */ +int success_count; /* number of successfully completed AIOCB */ + +bool is_read; +int vote_ret; +}; -- 1.8.3.2
[Qemu-devel] [PATCH V18 09/12] quorum: Add quorum_co_flush().
From: Benoît Canet ben...@irqsave.net Makes a vote to select error if any. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/quorum.c | 28 1 file changed, 28 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index d924055..1007dfc 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -629,12 +629,40 @@ static void quorum_invalidate_cache(BlockDriverState *bs) } } +static coroutine_fn int quorum_co_flush(BlockDriverState *bs) +{ +BDRVQuorumState *s = bs-opaque; +QuorumVoteVersion *winner = NULL; +QuorumVotes error_votes; +QuorumVoteValue result_value; +int i; +int result = 0; + +QLIST_INIT(error_votes.vote_list); +error_votes.compare = quorum_64bits_compare; + +for (i = 0; i s-num_children; i++) { +result = bdrv_co_flush(s-bs[i]); +result_value.l = result; +quorum_count_vote(error_votes, result_value, i); +} + +winner = quorum_get_vote_winner(error_votes); +result = winner-value.l; + +quorum_free_vote_list(error_votes); + +return result; +} + static BlockDriver bdrv_quorum = { .format_name= quorum, .protocol_name = quorum, .instance_size = sizeof(BDRVQuorumState), +.bdrv_co_flush_to_disk = quorum_co_flush, + .bdrv_getlength = quorum_getlength, .bdrv_aio_readv = quorum_aio_readv, -- 1.8.3.2
[Qemu-devel] [PATCH V18 10/12] quorum: Implement recursive .bdrv_recurse_is_first_non_filter in quorum.
From: Benoît Canet ben...@irqsave.net This is used to activate quorum snapshot. Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- block/quorum.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index 1007dfc..40832c0 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -655,6 +655,23 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) return result; } +static bool quorum_recurse_is_first_non_filter(BlockDriverState *bs, + BlockDriverState *candidate) +{ +BDRVQuorumState *s = bs-opaque; +int i; + +for (i = 0; i s-num_children; i++) { +bool perm = bdrv_recurse_is_first_non_filter(s-bs[i], + candidate); +if (perm) { +return true; +} +} + +return false; +} + static BlockDriver bdrv_quorum = { .format_name= quorum, .protocol_name = quorum, @@ -668,6 +685,8 @@ static BlockDriver bdrv_quorum = { .bdrv_aio_readv = quorum_aio_readv, .bdrv_aio_writev= quorum_aio_writev, .bdrv_invalidate_cache = quorum_invalidate_cache, + +.bdrv_recurse_is_first_non_filter = quorum_recurse_is_first_non_filter, }; static void bdrv_quorum_init(void) -- 1.8.3.2
[Qemu-devel] [PATCH V18 11/12] quorum: Add quorum_open() and quorum_close().
From: Benoît Canet ben...@irqsave.net Example of command line: -drive if=virtio,driver=quorum,\ children.0.file.filename=1.raw,\ children.0.node-name=1.raw,\ children.0.driver=raw,\ children.1.file.filename=2.raw,\ children.1.node-name=2.raw,\ children.1.driver=raw,\ children.2.file.filename=3.raw,\ children.2.node-name=3.raw,\ children.2.driver=raw,\ vote-threshold=2 blkverify=on with vote-threshold=2 and two files can be passed to emulate blkverify. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/quorum.c | 161 +++ monitor.c| 3 ++ qapi-schema.json | 21 +++- 3 files changed, 184 insertions(+), 1 deletion(-) diff --git a/block/quorum.c b/block/quorum.c index 40832c0..18721ba 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -20,6 +20,9 @@ #define HASH_LENGTH 32 +#define QUORUM_OPT_VOTE_THRESHOLD vote-threshold +#define QUORUM_OPT_BLKVERIFY blkverify + /* This union holds a vote hash value */ typedef union QuorumVoteValue { char h[HASH_LENGTH]; /* SHA-256 hash */ @@ -672,12 +675,170 @@ static bool quorum_recurse_is_first_non_filter(BlockDriverState *bs, return false; } +static int quorum_valid_threshold(int threshold, int num_children, Error **errp) +{ + +if (threshold 1) { +error_set(errp, QERR_INVALID_PARAMETER_VALUE, + vote-threshold, value = 1); +return -ERANGE; +} + +if (threshold num_children) { +error_setg(errp, threshold may not exceed children count); +return -ERANGE; +} + +return 0; +} + +static QemuOptsList quorum_runtime_opts = { +.name = quorum, +.head = QTAILQ_HEAD_INITIALIZER(quorum_runtime_opts.head), +.desc = { +{ +.name = QUORUM_OPT_VOTE_THRESHOLD, +.type = QEMU_OPT_NUMBER, +.help = The number of vote needed for reaching quorum, +}, +{ +.name = QUORUM_OPT_BLKVERIFY, +.type = QEMU_OPT_BOOL, +.help = Trigger block verify mode if set, +}, +{ /* end of list */ } +}, +}; + +static int quorum_open(BlockDriverState *bs, QDict *options, int flags, + Error **errp) +{ +BDRVQuorumState *s = bs-opaque; +Error *local_err = NULL; +QemuOpts *opts; +bool *opened; +QDict *sub = NULL; +QList *list = NULL; +const QListEntry *lentry; +const QDictEntry *dentry; +int i; +int ret = 0; + +qdict_flatten(options); +qdict_extract_subqdict(options, sub, children.); +qdict_array_split(sub, list); + +/* count how many different children are present and validate + * qdict_size(sub) address the open by reference case + */ +s-num_children = !qlist_size(list) ? qdict_size(sub) : qlist_size(list); +if (s-num_children 2) { +error_setg(local_err, + Number of provided children must be greater than 1); +ret = -EINVAL; +goto exit; +} + +opts = qemu_opts_create(quorum_runtime_opts, NULL, 0, error_abort); +qemu_opts_absorb_qdict(opts, options, local_err); +if (error_is_set(local_err)) { +ret = -EINVAL; +goto exit; +} + +s-threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0); + +/* and validate it against s-num_children */ +ret = quorum_valid_threshold(s-threshold, s-num_children, local_err); +if (ret 0) { +goto exit; +} + +/* is the driver in blkverify mode */ +if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) +s-num_children == 2 s-threshold == 2) { +s-is_blkverify = true; +} else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) { +fprintf(stderr, blkverify mode is set by setting blkverify=on +and using two files with vote_threshold=2\n); +} + +/* allocate the children BlockDriverState array */ +s-bs = g_new0(BlockDriverState *, s-num_children); +opened = g_new0(bool, s-num_children); + +/* Open by file name or options dict (command line or QMP) */ +if (s-num_children == qlist_size(list)) { +for (i = 0, lentry = qlist_first(list); lentry; +lentry = qlist_next(lentry), i++) { +QDict *d = qobject_to_qdict(lentry-value); +QINCREF(d); +ret = bdrv_open(s-bs[i], NULL, NULL, d, flags, NULL, local_err); +if (ret 0) { +goto close_exit; +} +opened[i] = true; +} +/* Open by QMP references */ +} else { +for (i = 0, dentry = qdict_first(sub); dentry; + dentry = qdict_next(sub, dentry), i++) { +QString *string = qobject_to_qstring(dentry-value); +ret = bdrv_open(s-bs[i], NULL, qstring_get_str(string), NULL, +flags, NULL, local_err); +if (ret 0) { +goto close_exit; +} +
[Qemu-devel] [PATCH V18 04/12] blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from blkverify.
From: Benoît Canet ben...@irqsave.net qemu_iovec_compare() will be used to compare IOs vectors in quorum blkverify mode. The patch extracts these functions in order to factorize the code. Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- block/blkverify.c | 108 +- include/qemu-common.h | 2 + util/iov.c| 106 + 3 files changed, 110 insertions(+), 106 deletions(-) diff --git a/block/blkverify.c b/block/blkverify.c index b57da59..c86ad7a 100644 --- a/block/blkverify.c +++ b/block/blkverify.c @@ -173,110 +173,6 @@ static int64_t blkverify_getlength(BlockDriverState *bs) return bdrv_getlength(s-test_file); } -/** - * Check that I/O vector contents are identical - * - * @a: I/O vector - * @b: I/O vector - * @ret:Offset to first mismatching byte or -1 if match - */ -static ssize_t blkverify_iovec_compare(QEMUIOVector *a, QEMUIOVector *b) -{ -int i; -ssize_t offset = 0; - -assert(a-niov == b-niov); -for (i = 0; i a-niov; i++) { -size_t len = 0; -uint8_t *p = (uint8_t *)a-iov[i].iov_base; -uint8_t *q = (uint8_t *)b-iov[i].iov_base; - -assert(a-iov[i].iov_len == b-iov[i].iov_len); -while (len a-iov[i].iov_len *p++ == *q++) { -len++; -} - -offset += len; - -if (len != a-iov[i].iov_len) { -return offset; -} -} -return -1; -} - -typedef struct { -int src_index; -struct iovec *src_iov; -void *dest_base; -} IOVectorSortElem; - -static int sortelem_cmp_src_base(const void *a, const void *b) -{ -const IOVectorSortElem *elem_a = a; -const IOVectorSortElem *elem_b = b; - -/* Don't overflow */ -if (elem_a-src_iov-iov_base elem_b-src_iov-iov_base) { -return -1; -} else if (elem_a-src_iov-iov_base elem_b-src_iov-iov_base) { -return 1; -} else { -return 0; -} -} - -static int sortelem_cmp_src_index(const void *a, const void *b) -{ -const IOVectorSortElem *elem_a = a; -const IOVectorSortElem *elem_b = b; - -return elem_a-src_index - elem_b-src_index; -} - -/** - * Copy contents of I/O vector - * - * The relative relationships of overlapping iovecs are preserved. This is - * necessary to ensure identical semantics in the cloned I/O vector. - */ -static void blkverify_iovec_clone(QEMUIOVector *dest, const QEMUIOVector *src, - void *buf) -{ -IOVectorSortElem sortelems[src-niov]; -void *last_end; -int i; - -/* Sort by source iovecs by base address */ -for (i = 0; i src-niov; i++) { -sortelems[i].src_index = i; -sortelems[i].src_iov = src-iov[i]; -} -qsort(sortelems, src-niov, sizeof(sortelems[0]), sortelem_cmp_src_base); - -/* Allocate buffer space taking into account overlapping iovecs */ -last_end = NULL; -for (i = 0; i src-niov; i++) { -struct iovec *cur = sortelems[i].src_iov; -ptrdiff_t rewind = 0; - -/* Detect overlap */ -if (last_end last_end cur-iov_base) { -rewind = last_end - cur-iov_base; -} - -sortelems[i].dest_base = buf - rewind; -buf += cur-iov_len - MIN(rewind, cur-iov_len); -last_end = MAX(cur-iov_base + cur-iov_len, last_end); -} - -/* Sort by source iovec index and build destination iovec */ -qsort(sortelems, src-niov, sizeof(sortelems[0]), sortelem_cmp_src_index); -for (i = 0; i src-niov; i++) { -qemu_iovec_add(dest, sortelems[i].dest_base, src-iov[i].iov_len); -} -} - static BlkverifyAIOCB *blkverify_aio_get(BlockDriverState *bs, bool is_write, int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, @@ -340,7 +236,7 @@ static void blkverify_aio_cb(void *opaque, int ret) static void blkverify_verify_readv(BlkverifyAIOCB *acb) { -ssize_t offset = blkverify_iovec_compare(acb-qiov, acb-raw_qiov); +ssize_t offset = qemu_iovec_compare(acb-qiov, acb-raw_qiov); if (offset != -1) { blkverify_err(acb, contents mismatch in sector % PRId64, acb-sector_num + (int64_t)(offset / BDRV_SECTOR_SIZE)); @@ -358,7 +254,7 @@ static BlockDriverAIOCB *blkverify_aio_readv(BlockDriverState *bs, acb-verify = blkverify_verify_readv; acb-buf = qemu_blockalign(bs-file, qiov-size); qemu_iovec_init(acb-raw_qiov, acb-qiov-niov); -blkverify_iovec_clone(acb-raw_qiov, qiov, acb-buf); +qemu_iovec_clone(acb-raw_qiov, qiov, acb-buf); bdrv_aio_readv(s-test_file, sector_num, qiov, nb_sectors, blkverify_aio_cb, acb); diff --git a/include/qemu-common.h b/include/qemu-common.h index 5054836..d70783e 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -346,6 +346,8 @@ size_t
[Qemu-devel] [PATCH V18 00/12] Quorum block filter
Hi, This is the latest version of the quorum block filter. Since the patch fixing the external_snapshot_prepare QDECREF(options) brain damage that I introduced earlier quorum works fine in every cases. in V18: Address all Max comments [Max] Fix sample quorum command line usage [Benoît] tested: Ran qemu-iotest including 081 ok quorum command line ok quorum command line snapshot transaction ok quorum QMP by reference ok quorum QMP by reference and snapshot transaction ok quorum QMP at once ok quorum QMP at once and snapshot transaction ok Benoît Canet (12): quorum: Create quorum.c, add QuorumChildRequest and QuorumAIOCB. quorum: Create BDRVQuorumState and BlkDriver and do init. quorum: Add quorum_aio_writev and its dependencies. blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from blkverify. quorum: Add quorum_aio_readv. quorum: Add quorum mechanism. quorum: Add quorum_getlength(). quorum: Add quorum_invalidate_cache(). quorum: Add quorum_co_flush(). quorum: Implement recursive .bdrv_recurse_is_first_non_filter in quorum. quorum: Add quorum_open() and quorum_close(). quorum: Add unit test. block/Makefile.objs| 1 + block/blkverify.c | 108 +- block/quorum.c | 858 + configure | 36 ++ docs/qmp/qmp-events.txt| 36 ++ include/monitor/monitor.h | 2 + include/qemu-common.h | 2 + monitor.c | 5 + qapi-schema.json | 21 +- tests/qemu-iotests/081 | 95 + tests/qemu-iotests/081.out | 34 ++ tests/qemu-iotests/group | 1 + util/iov.c | 106 ++ 13 files changed, 1198 insertions(+), 107 deletions(-) create mode 100644 block/quorum.c create mode 100755 tests/qemu-iotests/081 create mode 100644 tests/qemu-iotests/081.out -- 1.8.3.2
[Qemu-devel] [PATCH V18 06/12] quorum: Add quorum mechanism.
From: Benoît Canet ben...@irqsave.net This patchset enables the core of the quorum mechanism. The num_children reads are compared to get the majority version and if this version exists more than threshold times the guest won't see the error at all. If a block is corrupted or if an error occurs during an IO or if the quorum cannot be established QMP events are used to report to the management. Use gnutls's SHA-256 to compare versions. --enable-quorum must be used to enable the feature. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/Makefile.objs | 2 +- block/quorum.c| 391 +- configure | 36 + docs/qmp/qmp-events.txt | 36 + include/monitor/monitor.h | 2 + monitor.c | 2 + 6 files changed, 467 insertions(+), 2 deletions(-) diff --git a/block/Makefile.objs b/block/Makefile.objs index 716556f..425d7fb 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -3,7 +3,7 @@ block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-c block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-$(CONFIG_VHDX) += vhdx.o vhdx-endian.o vhdx-log.o -block-obj-y += quorum.o +block-obj-$(CONFIG_QUORUM) += quorum.o block-obj-y += parallels.o blkdebug.o blkverify.o block-obj-y += snapshot.o qapi.o block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o diff --git a/block/quorum.c b/block/quorum.c index fd19662..7acaa97 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -13,7 +13,43 @@ * See the COPYING file in the top-level directory. */ +#include gnutls/gnutls.h +#include gnutls/crypto.h #include block/block_int.h +#include qapi/qmp/qjson.h + +#define HASH_LENGTH 32 + +/* This union holds a vote hash value */ +typedef union QuorumVoteValue { +char h[HASH_LENGTH]; /* SHA-256 hash */ +int64_t l; /* simpler 64 bits hash */ +} QuorumVoteValue; + +/* A vote item */ +typedef struct QuorumVoteItem { +int index; +QLIST_ENTRY(QuorumVoteItem) next; +} QuorumVoteItem; + +/* this structure is a vote version. A version is the set of votes sharing the + * same vote value. + * The set of votes will be tracked with the items field and its cardinality is + * vote_count. + */ +typedef struct QuorumVoteVersion { +QuorumVoteValue value; +int index; +int vote_count; +QLIST_HEAD(, QuorumVoteItem) items; +QLIST_ENTRY(QuorumVoteVersion) next; +} QuorumVoteVersion; + +/* this structure holds a group of vote versions together */ +typedef struct QuorumVotes { +QLIST_HEAD(, QuorumVoteVersion) vote_list; +bool (*compare)(QuorumVoteValue *a, QuorumVoteValue *b); +} QuorumVotes; /* the following structure holds the state of one quorum instance */ typedef struct BDRVQuorumState { @@ -65,10 +101,14 @@ struct QuorumAIOCB { int count; /* number of completed AIOCB */ int success_count; /* number of successfully completed AIOCB */ +QuorumVotes votes; + bool is_read; int vote_ret; }; +static void quorum_vote(QuorumAIOCB *acb); + static void quorum_aio_cancel(BlockDriverAIOCB *blockacb) { QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common); @@ -94,6 +134,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) BDRVQuorumState *s = acb-common.bs-opaque; int i, ret = 0; +if (acb-vote_ret) { +ret = acb-vote_ret; +} + acb-common.cb(acb-common.opaque, ret); if (acb-is_read) { @@ -107,6 +151,16 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) qemu_aio_release(acb); } +static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b) +{ +return !memcmp(a-h, b-h, HASH_LENGTH); +} + +static bool quorum_64bits_compare(QuorumVoteValue *a, QuorumVoteValue *b) +{ +return a-l == b-l; +} + static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, BlockDriverState *bs, QEMUIOVector *qiov, @@ -125,6 +179,8 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, acb-qcrs = g_new0(QuorumChildRequest, s-num_children); acb-count = 0; acb-success_count = 0; +acb-votes.compare = quorum_sha256_compare; +QLIST_INIT(acb-votes.vote_list); acb-is_read = false; acb-vote_ret = 0; @@ -137,6 +193,48 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, return acb; } +static void quorum_report_bad(QuorumAIOCB *acb, char *node_name, int ret) +{ +QObject *data; +assert(node_name); +data = qobject_from_jsonf({ 'ret': %i + , 'node-name': \%s\ + , 'sector-num': % PRId64 + , 'sectors-count': %i }, + ret, node_name, acb-sector_num, acb-nb_sectors); +monitor_protocol_event(QEVENT_QUORUM_REPORT_BAD, data); +qobject_decref(data);
Re: [Qemu-devel] [PULL 00/20] acpi,pc,pci fixes and enhancements
On Thu, 13 Feb 2014, Peter Maydell wrote: On 10 February 2014 16:47, Michael S. Tsirkin m...@redhat.com wrote: The following changes since commit 2b2449f7e467957778ca006904471b231dc0ac8e: Merge remote-tracking branch 'remotes/borntraeger/tags/kvm-s390-20140131' into staging (2014-02-04 18:46:33 +) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream for you to fetch changes up to 417c45ab2f847c0a47b1232f611aa886df6a97d5: ACPI: Remove commented-out code from HPET._CRS (2014-02-10 11:09:33 +0200) Applied, thanks. It looks like that this series breaks disk unplug (hw/ide/piix.c:pci_piix3_xen_ide_unplug). I bisected it and the problem is caused by: commit 5e95494380ecf83c97d28f72134ab45e0cace8f9 Author: Igor Mammedov imamm...@redhat.com Date: Wed Feb 5 16:36:52 2014 +0100 hw/pci: switch to a generic hotplug handling for PCIDevice make qdev_unplug()/device_set_realized() to call hotplug handler's plug/unplug methods if available and remove not needed anymore hot(un)plug handling from PCIDevice. In case if hotplug handler is not available, revert to the legacy hotplug method for compatibility with not yet converted buses. Signed-off-by: Igor Mammedov imamm...@redhat.com Reviewed-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 02/18/2014 12:22 PM, Peter Maydell wrote: On 18 February 2014 11:16, Paolo Bonzini pbonz...@redhat.com wrote: Il 18/02/2014 12:09, Peter Maydell ha scritto: No, we've had that topic before: It's your job as submitter and maintainer to flag that appropriately in the commit message, as per QEMU Summit 2012. I don't think this workflow works. I have no idea what stable's criteria are, and if you rely on people adding a cc you're going to miss stuff. There isn't really a standard criterion. It's up to each maintainer to be stricter or looser on what goes to stable. My criteria for ARM in the past has typically been there's a new release every three months, anything that got past the release testing process for release N is sufficiently non-critical it can just go into release N+1. Unfortunately this doesn't work for distributions. Distros need to maintain a stable branch for the lifetime of a release to ensure that we're reasonably regression free. If you indicate that this doesn't apply to ARM it basically means you admit that ARM systems are not yet ready for stable use by customers when they want to use KVM. At least at the point when we agree that customers do want to run on a stable base for virtualization on ARM we need a working -stable system for critical fixes. Alex
[Qemu-devel] [PATCH V18 12/12] quorum: Add unit test.
Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- tests/qemu-iotests/081 | 95 ++ tests/qemu-iotests/081.out | 34 + tests/qemu-iotests/group | 1 + 3 files changed, 130 insertions(+) create mode 100755 tests/qemu-iotests/081 create mode 100644 tests/qemu-iotests/081.out diff --git a/tests/qemu-iotests/081 b/tests/qemu-iotests/081 new file mode 100755 index 000..be34544 --- /dev/null +++ b/tests/qemu-iotests/081 @@ -0,0 +1,95 @@ +#!/bin/bash +# +# Test Quorum block driver +# +# Copyright (C) 2013 Nodalink, SARL. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see http://www.gnu.org/licenses/. +# + +# creator +owner=ben...@irqsave.net + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ +rm -rf $TEST_DIR/1.raw +rm -rf $TEST_DIR/2.raw +rm -rf $TEST_DIR/3.raw +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common.rc +. ./common.filter + +_supported_fmt raw +_supported_proto generic +_supported_os Linux + +quorum=file.driver=quorum,file.children.0.file.filename=$TEST_DIR/1.raw +quorum=$quorum,file.children.1.file.filename=$TEST_DIR/2.raw +quorum=$quorum,file.children.2.file.filename=$TEST_DIR/3.raw,file.vote-threshold=2 + +echo +echo == creating quorum files == + +size=10M + +TEST_IMG=$TEST_DIR/1.raw _make_test_img $size +TEST_IMG=$TEST_DIR/2.raw _make_test_img $size +TEST_IMG=$TEST_DIR/3.raw _make_test_img $size + +echo +echo == writing images == + +$QEMU_IO -c open -o $quorum -c write -P 0x32 0 $size | _filter_qemu_io + +echo +echo == checking quorum write == + +$QEMU_IO -c read -P 0x32 0 $size $TEST_DIR/1.raw | _filter_qemu_io +$QEMU_IO -c read -P 0x32 0 $size $TEST_DIR/2.raw | _filter_qemu_io +$QEMU_IO -c read -P 0x32 0 $size $TEST_DIR/3.raw | _filter_qemu_io + +echo +echo == corrupting image == + +$QEMU_IO -c write -P 0x42 0 $size $TEST_DIR/2.raw | _filter_qemu_io + +echo +echo == checking quorum correction == + +$QEMU_IO -c open -o $quorum -c read -P 0x32 0 $size | _filter_qemu_io + +echo +echo == breaking quorum == + +$QEMU_IO -c write -P 0x41 0 $size $TEST_DIR/1.raw | _filter_qemu_io +echo +echo == checking that quorum is broken == + +$QEMU_IO -c open -o $quorum -c read -P 0x32 0 $size | _filter_qemu_io + + +# success, all done +echo *** done +rm -f $seq.full +status=0 diff --git a/tests/qemu-iotests/081.out b/tests/qemu-iotests/081.out new file mode 100644 index 000..b5b55ab --- /dev/null +++ b/tests/qemu-iotests/081.out @@ -0,0 +1,34 @@ +QA output created by 081 + +== creating quorum files == +Formatting 'TEST_DIR/1.IMGFMT', fmt=IMGFMT size=10485760 +Formatting 'TEST_DIR/2.IMGFMT', fmt=IMGFMT size=10485760 +Formatting 'TEST_DIR/3.IMGFMT', fmt=IMGFMT size=10485760 + +== writing images == +wrote 10485760/10485760 bytes at offset 0 +10 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== checking quorum write == +read 10485760/10485760 bytes at offset 0 +10 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 10485760/10485760 bytes at offset 0 +10 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 10485760/10485760 bytes at offset 0 +10 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== corrupting image == +wrote 10485760/10485760 bytes at offset 0 +10 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== checking quorum correction == +read 10485760/10485760 bytes at offset 0 +10 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== breaking quorum == +wrote 10485760/10485760 bytes at offset 0 +10 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== checking that quorum is broken == +qemu-io: can't open device (null): Could not read image for determining its format: Input/output error +*** done diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group index d8be74a..fe5d764 100644 --- a/tests/qemu-iotests/group +++ b/tests/qemu-iotests/group @@ -83,3 +83,4 @@ 074 rw auto 077 rw auto 079 rw auto +081 rw auto -- 1.8.3.2
Re: [Qemu-devel] [PULL 00/20] acpi,pc,pci fixes and enhancements
Il 18/02/2014 13:16, Stefano Stabellini ha scritto: It looks like that this series breaks disk unplug (hw/ide/piix.c:pci_piix3_xen_ide_unplug). I bisected it and the problem is caused by: commit 5e95494380ecf83c97d28f72134ab45e0cace8f9 Author: Igor Mammedov imamm...@redhat.com Date: Wed Feb 5 16:36:52 2014 +0100 hw/pci: switch to a generic hotplug handling for PCIDevice make qdev_unplug()/device_set_realized() to call hotplug handler's plug/unplug methods if available and remove not needed anymore hot(un)plug handling from PCIDevice. In case if hotplug handler is not available, revert to the legacy hotplug method for compatibility with not yet converted buses. Signed-off-by: Igor Mammedov imamm...@redhat.com Reviewed-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com What exactly breaks? Paolo
Re: [Qemu-devel] [PATCH V17 11/12] quorum: Add quorum_open() and quorum_close().
The Tuesday 18 Feb 2014 à 12:06:57 (+0800), Fam Zheng wrote : On Thu, 02/13 10:09, Benoît Canet wrote: The Wednesday 12 Feb 2014 à 23:06:38 (+0100), Benoît Canet wrote : +static void quorum_close(BlockDriverState *bs) +{ +BDRVQuorumState *s = bs-opaque; +int i; + +for (i = 0; i s-num_children; i++) { +bdrv_unref(s-bs[i]); Quorum crash here from time to time I don't understand why. I think you could add printf or use gdb to examine every bdrv_unref on the crashing bs, so you can track down to the unbalanced reference. I found the cause of the crash since: I added a spurious QDECREF() in the external snapshot prepare function. It's fixed now. Thanks, Benoît Fam
[Qemu-devel] [PATCH] kvm: fix kvm_set_migration_log() behavior
The test (!!(mem-flags KVM_MEM_LOG_DIRTY_PAGES) == enable) is wrong because the condition is valid when enable = 0 and current dirty log memory flag is set. As a consequence kvm_log_global_stop() does not stop the KVM dirty log tracking: kvm_set_migration_log(0) didn't do its job. So instead I propose to use kvm_slot_dirty_pages_log_change() which correctly compare the memory flags (old/new). Signed-off-by: Vincent KHERBACHE vincent.kherba...@inria.fr --- kvm-all.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 2ca9143..f104f87 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -355,7 +355,7 @@ static int kvm_set_migration_log(int enable) { KVMState *s = kvm_state; KVMSlot *mem; -int i, err; +int i, err = 0; s-migration_log = enable; @@ -365,15 +365,9 @@ static int kvm_set_migration_log(int enable) if (!mem-memory_size) { continue; } -if (!!(mem-flags KVM_MEM_LOG_DIRTY_PAGES) == enable) { -continue; -} -err = kvm_set_user_memory_region(s, mem); -if (err) { -return err; -} +err = kvm_slot_dirty_pages_log_change(mem, (bool)enable); } -return 0; +return err; } /* get kvm's dirty pages bitmap and update qemu's */ -- 1.8.3.1
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 18 February 2014 12:17, Alexander Graf ag...@suse.de wrote: On 02/18/2014 12:22 PM, Peter Maydell wrote: My criteria for ARM in the past has typically been there's a new release every three months, anything that got past the release testing process for release N is sufficiently non-critical it can just go into release N+1. Unfortunately this doesn't work for distributions. Distros need to maintain a stable branch for the lifetime of a release to ensure that we're reasonably regression free. If you indicate that this doesn't apply to ARM it basically means you admit that ARM systems are not yet ready for stable use by customers when they want to use KVM. At least at the point when we agree that customers do want to run on a stable base for virtualization on ARM we need a working -stable system for critical fixes. I agree in general that ARM support needs to move from its traditional this is just a dev tool situation to a broader level of support/stability guarantees for KVM. (We're not yet guaranteeing cross-version migration, for another example there.) However again we run into the definition of what's a critical fix?. I think if distros need a stable branch then they need to be prepared to do the work of sorting through what counts as a critical fix that needs to be ported to that branch. (For instance, which boards and targets do they care about?) For instance patch 3 only applies to the integrator board, and I don't consider the guest-to-host border to be a security boundary for most of our legacy board models: there's just too much unaudited device code for that to be trustable. thanks -- PMM
[Qemu-devel] [PATCH 0/8] virtio endian-ambivalent target fixes (rebased)
On Fri, 14 Feb 2014 12:59:49 +0100 Andreas Färber afaer...@suse.de wrote: Hi, [...] It might've helped if Rusty had actually used our scripts/get_maintainer.pl script to CC people. While Anthony seems to have reviewed some patches (usually Reviewed-by should be before the final Signed-off-by fwiw), neither Stefan (virtio-net) nor Kevin (virtio-blk) nor Paolo (virtio-scsi) were CC'ed, and recently Michael stepped up as virtio maintainer, so maybe he can take them once ready. 1/7 looks okay to me; 3-7 are rather mechanical - people will need to review that those changes are sufficient for the current code base. We've since converted virtio devices to QOM realize, so a rebase is likely needed for such an old series. Regards, Andreas Hi, Thank you Andreas for your support. :) Here is a rebased patchset. Only the two first patches (virtio_get_byteswap: and virtio:) had to be updated actually. FWIW, this the very same patchset I use, along with some ppc64 specific enablement code, to have functionnal ppc64 LE guests. --- Greg Kurz (1): hw/9pfs/virtio_9p_device: use virtio wrappers to access headers. Rusty Russell (7): virtio_get_byteswap: function for endian-ambivalent targets using virtio. virtio: allow byte swapping for vring and config access hw/net/virtio-net: use virtio wrappers to access headers. hw/net/virtio-balloon: use virtio wrappers to access page frame numbers. hw/block/virtio-blk: use virtio wrappers to access headers. hw/scsi/virtio-scsi: use virtio wrappers to access headers. hw/char/virtio-serial-bus: use virtio wrappers to access headers. hw/9pfs/virtio-9p-device.c|3 + hw/block/virtio-blk.c | 35 +- hw/char/virtio-serial-bus.c | 34 +- hw/net/virtio-net.c | 15 ++-- hw/scsi/virtio-scsi.c | 33 + hw/virtio/virtio-balloon.c|3 + hw/virtio/virtio.c| 38 ++- include/hw/virtio/virtio-access.h | 132 + include/hw/virtio/virtio.h|2 + stubs/Makefile.objs |1 stubs/virtio_get_byteswap.c |6 ++ 11 files changed, 228 insertions(+), 74 deletions(-) create mode 100644 include/hw/virtio/virtio-access.h create mode 100644 stubs/virtio_get_byteswap.c Best Regards. -- Greg
[Qemu-devel] [PATCH 3/8] hw/net/virtio-net: use virtio wrappers to access headers.
From: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Rusty Russell ru...@rustcorp.com.au Reviewed-by: Anthony Liguori aligu...@us.ibm.com --- hw/net/virtio-net.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 3626608..34d6b48 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -23,6 +23,7 @@ #include hw/virtio/virtio-bus.h #include qapi/qmp/qjson.h #include monitor/monitor.h +#include hw/virtio/virtio-access.h #define VIRTIO_NET_VM_VERSION11 @@ -72,8 +73,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config) VirtIONet *n = VIRTIO_NET(vdev); struct virtio_net_config netcfg; -stw_p(netcfg.status, n-status); -stw_p(netcfg.max_virtqueue_pairs, n-max_queues); +virtio_stw_p(netcfg.status, n-status); +virtio_stw_p(netcfg.max_virtqueue_pairs, n-max_queues); memcpy(netcfg.mac, n-mac, ETH_ALEN); memcpy(config, netcfg, n-config_size); } @@ -618,7 +619,7 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t cmd, s = iov_to_buf(iov, iov_cnt, 0, mac_data.entries, sizeof(mac_data.entries)); -mac_data.entries = ldl_p(mac_data.entries); +mac_data.entries = virtio_ldl_p(mac_data.entries); if (s != sizeof(mac_data.entries)) { goto error; } @@ -645,7 +646,7 @@ static int virtio_net_handle_mac(VirtIONet *n, uint8_t cmd, s = iov_to_buf(iov, iov_cnt, 0, mac_data.entries, sizeof(mac_data.entries)); -mac_data.entries = ldl_p(mac_data.entries); +mac_data.entries = virtio_ldl_p(mac_data.entries); if (s != sizeof(mac_data.entries)) { goto error; } @@ -690,7 +691,7 @@ static int virtio_net_handle_vlan_table(VirtIONet *n, uint8_t cmd, NetClientState *nc = qemu_get_queue(n-nic); s = iov_to_buf(iov, iov_cnt, 0, vid, sizeof(vid)); -vid = lduw_p(vid); +vid = virtio_lduw_p(vid); if (s != sizeof(vid)) { return VIRTIO_NET_ERR; } @@ -727,7 +728,7 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } -queues = lduw_p(mq.virtqueue_pairs); +queues = virtio_lduw_p(mq.virtqueue_pairs); if (queues VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MIN || queues VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX || @@ -1026,7 +1027,7 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t } if (mhdr_cnt) { -stw_p(mhdr.num_buffers, i); +virtio_stw_p(mhdr.num_buffers, i); iov_from_buf(mhdr_sg, mhdr_cnt, 0, mhdr.num_buffers, sizeof mhdr.num_buffers);
[Qemu-devel] [PATCH 6/8] hw/scsi/virtio-scsi: use virtio wrappers to access headers.
From: Rusty Russell ru...@rustcorp.com.au Note that st*_raw and ld*_raw are effectively replaced by st*_p and ld*_p. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Reviewed-by: Anthony Liguori aligu...@us.ibm.com --- hw/scsi/virtio-scsi.c | 33 + 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 6610b3a..df6c0e2 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -18,6 +18,7 @@ #include hw/scsi/scsi.h #include block/scsi.h #include hw/virtio/virtio-bus.h +#include hw/virtio/virtio-access.h typedef struct VirtIOSCSIReq { VirtIOSCSI *dev; @@ -313,12 +314,12 @@ static void virtio_scsi_command_complete(SCSIRequest *r, uint32_t status, req-resp.cmd-response = VIRTIO_SCSI_S_OK; req-resp.cmd-status = status; if (req-resp.cmd-status == GOOD) { -req-resp.cmd-resid = tswap32(resid); +req-resp.cmd-resid = virtio_tswap32(resid); } else { req-resp.cmd-resid = 0; sense_len = scsi_req_get_sense(r, req-resp.cmd-sense, VIRTIO_SCSI_SENSE_SIZE); -req-resp.cmd-sense_len = tswap32(sense_len); +req-resp.cmd-sense_len = virtio_tswap32(sense_len); } virtio_scsi_complete_req(req); } @@ -414,16 +415,16 @@ static void virtio_scsi_get_config(VirtIODevice *vdev, VirtIOSCSIConfig *scsiconf = (VirtIOSCSIConfig *)config; VirtIOSCSICommon *s = VIRTIO_SCSI_COMMON(vdev); -stl_raw(scsiconf-num_queues, s-conf.num_queues); -stl_raw(scsiconf-seg_max, 128 - 2); -stl_raw(scsiconf-max_sectors, s-conf.max_sectors); -stl_raw(scsiconf-cmd_per_lun, s-conf.cmd_per_lun); -stl_raw(scsiconf-event_info_size, sizeof(VirtIOSCSIEvent)); -stl_raw(scsiconf-sense_size, s-sense_size); -stl_raw(scsiconf-cdb_size, s-cdb_size); -stw_raw(scsiconf-max_channel, VIRTIO_SCSI_MAX_CHANNEL); -stw_raw(scsiconf-max_target, VIRTIO_SCSI_MAX_TARGET); -stl_raw(scsiconf-max_lun, VIRTIO_SCSI_MAX_LUN); +virtio_stl_p(scsiconf-num_queues, s-conf.num_queues); +virtio_stl_p(scsiconf-seg_max, 128 - 2); +virtio_stl_p(scsiconf-max_sectors, s-conf.max_sectors); +virtio_stl_p(scsiconf-cmd_per_lun, s-conf.cmd_per_lun); +virtio_stl_p(scsiconf-event_info_size, sizeof(VirtIOSCSIEvent)); +virtio_stl_p(scsiconf-sense_size, s-sense_size); +virtio_stl_p(scsiconf-cdb_size, s-cdb_size); +virtio_stw_p(scsiconf-max_channel, VIRTIO_SCSI_MAX_CHANNEL); +virtio_stw_p(scsiconf-max_target, VIRTIO_SCSI_MAX_TARGET); +virtio_stl_p(scsiconf-max_lun, VIRTIO_SCSI_MAX_LUN); } static void virtio_scsi_set_config(VirtIODevice *vdev, @@ -432,14 +433,14 @@ static void virtio_scsi_set_config(VirtIODevice *vdev, VirtIOSCSIConfig *scsiconf = (VirtIOSCSIConfig *)config; VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev); -if ((uint32_t) ldl_raw(scsiconf-sense_size) = 65536 || -(uint32_t) ldl_raw(scsiconf-cdb_size) = 256) { +if ((uint32_t) virtio_ldl_p(scsiconf-sense_size) = 65536 || +(uint32_t) virtio_ldl_p(scsiconf-cdb_size) = 256) { error_report(bad data written to virtio-scsi configuration space); exit(1); } -vs-sense_size = ldl_raw(scsiconf-sense_size); -vs-cdb_size = ldl_raw(scsiconf-cdb_size); +vs-sense_size = virtio_ldl_p(scsiconf-sense_size); +vs-cdb_size = virtio_ldl_p(scsiconf-cdb_size); } static uint32_t virtio_scsi_get_features(VirtIODevice *vdev,
[Qemu-devel] [PATCH 5/8] hw/block/virtio-blk: use virtio wrappers to access headers.
From: Rusty Russell ru...@rustcorp.com.au Note that st*_raw and ld*_raw are effectively replaced by st*_p and ld*_p. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Reviewed-by: Anthony Liguori aligu...@us.ibm.com --- hw/block/virtio-blk.c | 35 ++- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 8a568e5..f55f89c 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -26,6 +26,7 @@ # include scsi/sg.h #endif #include hw/virtio/virtio-bus.h +#include hw/virtio/virtio-access.h typedef struct VirtIOBlockReq { @@ -77,7 +78,7 @@ static void virtio_blk_rw_complete(void *opaque, int ret) trace_virtio_blk_rw_complete(req, ret); if (ret) { -bool is_read = !(ldl_p(req-out-type) VIRTIO_BLK_T_OUT); +bool is_read = !(virtio_ldl_p(req-out-type) VIRTIO_BLK_T_OUT); if (virtio_blk_handle_rw_error(req, -ret, is_read)) return; } @@ -224,12 +225,12 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req) hdr.status = CHECK_CONDITION; } -stl_p(req-scsi-errors, - hdr.status | (hdr.msg_status 8) | - (hdr.host_status 16) | (hdr.driver_status 24)); -stl_p(req-scsi-residual, hdr.resid); -stl_p(req-scsi-sense_len, hdr.sb_len_wr); -stl_p(req-scsi-data_len, hdr.dxfer_len); +virtio_stl_p(req-scsi-errors, + hdr.status | (hdr.msg_status 8) | + (hdr.host_status 16) | (hdr.driver_status 24)); +virtio_stl_p(req-scsi-residual, hdr.resid); +virtio_stl_p(req-scsi-sense_len, hdr.sb_len_wr); +virtio_stl_p(req-scsi-data_len, hdr.dxfer_len); virtio_blk_req_complete(req, status); g_free(req); @@ -240,7 +241,7 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req) fail: /* Just put anything nonzero so that the ioctl fails in the guest. */ -stl_p(req-scsi-errors, 255); +virtio_stl_p(req-scsi-errors, 255); virtio_blk_req_complete(req, status); g_free(req); } @@ -286,7 +287,7 @@ static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb) BlockRequest *blkreq; uint64_t sector; -sector = ldq_p(req-out-sector); +sector = virtio_ldq_p(req-out-sector); bdrv_acct_start(req-dev-bs, req-acct, req-qiov.size, BDRV_ACCT_WRITE); @@ -320,7 +321,7 @@ static void virtio_blk_handle_read(VirtIOBlockReq *req) { uint64_t sector; -sector = ldq_p(req-out-sector); +sector = virtio_ldq_p(req-out-sector); bdrv_acct_start(req-dev-bs, req-acct, req-qiov.size, BDRV_ACCT_READ); @@ -358,7 +359,7 @@ static void virtio_blk_handle_request(VirtIOBlockReq *req, req-out = (void *)req-elem.out_sg[0].iov_base; req-in = (void *)req-elem.in_sg[req-elem.in_num - 1].iov_base; -type = ldl_p(req-out-type); +type = virtio_ldl_p(req-out-type); if (type VIRTIO_BLK_T_FLUSH) { virtio_blk_handle_flush(req, mrb); @@ -487,12 +488,12 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config) bdrv_get_geometry(s-bs, capacity); memset(blkcfg, 0, sizeof(blkcfg)); -stq_raw(blkcfg.capacity, capacity); -stl_raw(blkcfg.seg_max, 128 - 2); -stw_raw(blkcfg.cylinders, s-conf-cyls); -stl_raw(blkcfg.blk_size, blk_size); -stw_raw(blkcfg.min_io_size, s-conf-min_io_size / blk_size); -stw_raw(blkcfg.opt_io_size, s-conf-opt_io_size / blk_size); +virtio_stq_p(blkcfg.capacity, capacity); +virtio_stl_p(blkcfg.seg_max, 128 - 2); +virtio_stw_p(blkcfg.cylinders, s-conf-cyls); +virtio_stl_p(blkcfg.blk_size, blk_size); +virtio_stw_p(blkcfg.min_io_size, s-conf-min_io_size / blk_size); +virtio_stw_p(blkcfg.opt_io_size, s-conf-opt_io_size / blk_size); blkcfg.heads = s-conf-heads; /* * We must ensure that the block device capacity is a multiple of
[Qemu-devel] [PATCH 1/8] virtio_get_byteswap: function for endian-ambivalent targets using virtio.
From: Rusty Russell ru...@rustcorp.com.au virtio data structures are defined as target endian, which assumes that's a fixed value. In fact, that actually means it's platform-specific. The OASIS virtio 1.0 spec will fix this. Meanwhile, create a hook for little endian ppc (and potentially ARM). This is called at device reset time (which is done before any driver is loaded) since it may involve a system call to get the status when running under kvm. [ fixed checkpatch.pl error with the virtio_byteswap initialisation, ldq_phys() API change, Greg Kurz gk...@linux.vnet.ibm.com ] Signed-off-by: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com --- hw/virtio/virtio.c|6 ++ include/hw/virtio/virtio-access.h | 132 + include/hw/virtio/virtio.h|2 + stubs/Makefile.objs |1 stubs/virtio_get_byteswap.c |6 ++ 5 files changed, 147 insertions(+) create mode 100644 include/hw/virtio/virtio-access.h create mode 100644 stubs/virtio_get_byteswap.c diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index aeabf3a..4fd6ac2 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -19,6 +19,9 @@ #include hw/virtio/virtio.h #include qemu/atomic.h #include hw/virtio/virtio-bus.h +#include hw/virtio/virtio-access.h + +bool virtio_byteswap; /* * The alignment to use between consumer and producer parts of vring. @@ -546,6 +549,9 @@ void virtio_reset(void *opaque) virtio_set_status(vdev, 0); +/* We assume all devices are the same endian. */ +virtio_byteswap = virtio_get_byteswap(); + if (k-reset) { k-reset(vdev); } diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h new file mode 100644 index 000..2e22a47 --- /dev/null +++ b/include/hw/virtio/virtio-access.h @@ -0,0 +1,132 @@ +/* + * Virtio Accessor Support: In case your target can change endian. + * + * Copyright IBM, Corp. 2013 + * + * Authors: + * Rusty Russell ru...@au.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#ifndef _QEMU_VIRTIO_ACCESS_H +#define _QEMU_VIRTIO_ACCESS_H +#include hw/virtio/virtio.h + +/* Initialized by virtio_get_byteswap() at any virtio device reset. */ +extern bool virtio_byteswap; + +static inline uint16_t virtio_lduw_phys(AddressSpace *as, hwaddr pa) +{ +if (virtio_byteswap) { +return bswap16(lduw_phys(as, pa)); +} +return lduw_phys(as, pa); +} + +static inline uint32_t virtio_ldl_phys(AddressSpace *as, hwaddr pa) +{ +if (virtio_byteswap) { +return bswap32(ldl_phys(as, pa)); +} +return ldl_phys(as, pa); +} + +static inline uint64_t virtio_ldq_phys(AddressSpace *as, hwaddr pa) +{ +if (virtio_byteswap) { +return bswap64(ldq_phys(as, pa)); +} +return ldq_phys(as, pa); +} + +static inline void virtio_stw_phys(AddressSpace *as, hwaddr pa, uint16_t value) +{ +if (virtio_byteswap) { +stw_phys(as, pa, bswap16(value)); +} else { +stw_phys(as, pa, value); +} +} + +static inline void virtio_stl_phys(AddressSpace *as, hwaddr pa, uint32_t value) +{ +if (virtio_byteswap) { +stl_phys(as, pa, bswap32(value)); +} else { +stl_phys(as, pa, value); +} +} + +static inline void virtio_stw_p(void *ptr, uint16_t v) +{ +if (virtio_byteswap) { +stw_p(ptr, bswap16(v)); +} else { +stw_p(ptr, v); +} +} + +static inline void virtio_stl_p(void *ptr, uint32_t v) +{ +if (virtio_byteswap) { +stl_p(ptr, bswap32(v)); +} else { +stl_p(ptr, v); +} +} + +static inline void virtio_stq_p(void *ptr, uint64_t v) +{ +if (virtio_byteswap) { +stq_p(ptr, bswap64(v)); +} else { +stq_p(ptr, v); +} +} + +static inline int virtio_lduw_p(const void *ptr) +{ +if (virtio_byteswap) { +return bswap16(lduw_p(ptr)); +} else { +return lduw_p(ptr); +} +} + +static inline int virtio_ldl_p(const void *ptr) +{ +if (virtio_byteswap) { +return bswap32(ldl_p(ptr)); +} else { +return ldl_p(ptr); +} +} + +static inline uint64_t virtio_ldq_p(const void *ptr) +{ +if (virtio_byteswap) { +return bswap64(ldq_p(ptr)); +} else { +return ldq_p(ptr); +} +} + +static inline uint32_t virtio_tswap32(uint32_t s) +{ +if (virtio_byteswap) { +return bswap32(tswap32(s)); +} else { +return tswap32(s); +} +} + +static inline void virtio_tswap32s(uint32_t *s) +{ +tswap32s(s); +if (virtio_byteswap) { +*s = bswap32(*s); +} +} +#endif /* _QEMU_VIRTIO_ACCESS_H */ diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index 3e54e90..5009945 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -253,4 +253,6 @@ void
Re: [Qemu-devel] [PATCH] block/iscsi: fix deadlock on scsi check condition
Il 18/02/2014 13:08, Peter Lieven ha scritto: the retry logic was broken because the complete status of the task structure was not reset. this resulted in an infinite loop retrying the command over and over. CC: qemu-sta...@nongnu.org Signed-off-by: Peter Lieven p...@kamp.de --- block/iscsi.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/block/iscsi.c b/block/iscsi.c index 3c0b728..ded414e 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -145,12 +145,13 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status, if (iTask-retries-- 0 status == SCSI_STATUS_CHECK_CONDITION task-sense.key == SCSI_SENSE_UNIT_ATTENTION) { +error_report(iSCSI CheckCondition: %s, iscsi_get_error(iscsi)); iTask-do_retry = 1; goto out; } if (status != SCSI_STATUS_GOOD) { -error_report(iSCSI: Failure. %s, iscsi_get_error(iscsi)); +error_report(iSCSI Failure: %s, iscsi_get_error(iscsi)); } out: @@ -325,6 +326,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -399,6 +401,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -433,6 +436,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -683,6 +687,7 @@ retry: scsi_free_scsi_task(iTask.task); iTask.task = NULL; } +iTask.complete = 0; goto retry; } @@ -767,6 +772,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } @@ -836,6 +842,7 @@ retry: } if (iTask.do_retry) { +iTask.complete = 0; goto retry; } Applied to scsi-next, thanks. Paolo
[Qemu-devel] [PATCH 2/8] virtio: allow byte swapping for vring and config access
From: Rusty Russell ru...@rustcorp.com.au This is based on a simpler patch by Anthony Liguouri, which only handled the vring accesses. We also need some drivers to access these helpers, eg. for data which contains headers. [ ldq_phys() API change, Greg Kurz gk...@linux.vnet.ibm.com ] Signed-off-by: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com --- hw/virtio/virtio.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 4fd6ac2..c2c3a7c 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -108,49 +108,49 @@ static inline uint64_t vring_desc_addr(hwaddr desc_pa, int i) { hwaddr pa; pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, addr); -return ldq_phys(address_space_memory, pa); +return virtio_ldq_phys(address_space_memory, pa); } static inline uint32_t vring_desc_len(hwaddr desc_pa, int i) { hwaddr pa; pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, len); -return ldl_phys(address_space_memory, pa); +return virtio_ldl_phys(address_space_memory, pa); } static inline uint16_t vring_desc_flags(hwaddr desc_pa, int i) { hwaddr pa; pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, flags); -return lduw_phys(address_space_memory, pa); +return virtio_lduw_phys(address_space_memory, pa); } static inline uint16_t vring_desc_next(hwaddr desc_pa, int i) { hwaddr pa; pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, next); -return lduw_phys(address_space_memory, pa); +return virtio_lduw_phys(address_space_memory, pa); } static inline uint16_t vring_avail_flags(VirtQueue *vq) { hwaddr pa; pa = vq-vring.avail + offsetof(VRingAvail, flags); -return lduw_phys(address_space_memory, pa); +return virtio_lduw_phys(address_space_memory, pa); } static inline uint16_t vring_avail_idx(VirtQueue *vq) { hwaddr pa; pa = vq-vring.avail + offsetof(VRingAvail, idx); -return lduw_phys(address_space_memory, pa); +return virtio_lduw_phys(address_space_memory, pa); } static inline uint16_t vring_avail_ring(VirtQueue *vq, int i) { hwaddr pa; pa = vq-vring.avail + offsetof(VRingAvail, ring[i]); -return lduw_phys(address_space_memory, pa); +return virtio_lduw_phys(address_space_memory, pa); } static inline uint16_t vring_used_event(VirtQueue *vq) @@ -162,44 +162,44 @@ static inline void vring_used_ring_id(VirtQueue *vq, int i, uint32_t val) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, ring[i].id); -stl_phys(address_space_memory, pa, val); +virtio_stl_phys(address_space_memory, pa, val); } static inline void vring_used_ring_len(VirtQueue *vq, int i, uint32_t val) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, ring[i].len); -stl_phys(address_space_memory, pa, val); +virtio_stl_phys(address_space_memory, pa, val); } static uint16_t vring_used_idx(VirtQueue *vq) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, idx); -return lduw_phys(address_space_memory, pa); +return virtio_lduw_phys(address_space_memory, pa); } static inline void vring_used_idx_set(VirtQueue *vq, uint16_t val) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, idx); -stw_phys(address_space_memory, pa, val); +virtio_stw_phys(address_space_memory, pa, val); } static inline void vring_used_flags_set_bit(VirtQueue *vq, int mask) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, flags); -stw_phys(address_space_memory, - pa, lduw_phys(address_space_memory, pa) | mask); +virtio_stw_phys(address_space_memory, + pa, virtio_lduw_phys(address_space_memory, pa) | mask); } static inline void vring_used_flags_unset_bit(VirtQueue *vq, int mask) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, flags); -stw_phys(address_space_memory, - pa, lduw_phys(address_space_memory, pa) ~mask); +virtio_stw_phys(address_space_memory, +pa, lduw_phys(address_space_memory, pa) ~mask); } static inline void vring_avail_event(VirtQueue *vq, uint16_t val) @@ -209,7 +209,7 @@ static inline void vring_avail_event(VirtQueue *vq, uint16_t val) return; } pa = vq-vring.used + offsetof(VRingUsed, ring[vq-vring.num]); -stw_phys(address_space_memory, pa, val); +virtio_stw_phys(address_space_memory, pa, val); } void virtio_queue_set_notification(VirtQueue *vq, int enable)
[Qemu-devel] [PATCH V18 05/12] quorum: Add quorum_aio_readv.
From: Benoît Canet ben...@irqsave.net Add code to do num_children reads in parallel and cleanup the structures afterwards. Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- block/quorum.c | 39 ++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/block/quorum.c b/block/quorum.c index 99e0e03..fd19662 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -91,10 +91,18 @@ static AIOCBInfo quorum_aiocb_info = { static void quorum_aio_finalize(QuorumAIOCB *acb) { -int ret = 0; +BDRVQuorumState *s = acb-common.bs-opaque; +int i, ret = 0; acb-common.cb(acb-common.opaque, ret); +if (acb-is_read) { +for (i = 0; i s-num_children; i++) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} +} + g_free(acb-qcrs); qemu_aio_release(acb); } @@ -149,6 +157,34 @@ static void quorum_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs, + int64_t sector_num, + QEMUIOVector *qiov, + int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque) +{ +BDRVQuorumState *s = bs-opaque; +QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, + nb_sectors, cb, opaque); +int i; + +acb-is_read = true; + +for (i = 0; i s-num_children; i++) { +acb-qcrs[i].buf = qemu_blockalign(s-bs[i], qiov-size); +qemu_iovec_init(acb-qcrs[i].qiov, qiov-niov); +qemu_iovec_clone(acb-qcrs[i].qiov, qiov, acb-qcrs[i].buf); +} + +for (i = 0; i s-num_children; i++) { +bdrv_aio_readv(s-bs[i], sector_num, qiov, nb_sectors, + quorum_aio_cb, acb-qcrs[i]); +} + +return acb-common; +} + static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov, @@ -176,6 +212,7 @@ static BlockDriver bdrv_quorum = { .instance_size = sizeof(BDRVQuorumState), +.bdrv_aio_readv = quorum_aio_readv, .bdrv_aio_writev= quorum_aio_writev, }; -- 1.8.3.2
[Qemu-devel] [PATCH 4/8] hw/net/virtio-balloon: use virtio wrappers to access page frame numbers.
From: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Rusty Russell ru...@rustcorp.com.au Reviewed-by: Anthony Liguori aligu...@us.ibm.com --- hw/virtio/virtio-balloon.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c index a470a0b..5e7f26f 100644 --- a/hw/virtio/virtio-balloon.c +++ b/hw/virtio/virtio-balloon.c @@ -30,6 +30,7 @@ #endif #include hw/virtio/virtio-bus.h +#include hw/virtio/virtio-access.h static void balloon_page(void *addr, int deflate) { @@ -192,7 +193,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) ram_addr_t pa; ram_addr_t addr; -pa = (ram_addr_t)ldl_p(pfn) VIRTIO_BALLOON_PFN_SHIFT; +pa = (ram_addr_t)virtio_ldl_p(pfn) VIRTIO_BALLOON_PFN_SHIFT; offset += 4; /* FIXME: remove get_system_memory(), but how? */
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
Am 18.02.2014 12:16, schrieb Paolo Bonzini: Il 18/02/2014 12:09, Peter Maydell ha scritto: No, we've had that topic before: It's your job as submitter and maintainer to flag that appropriately in the commit message, as per QEMU Summit 2012. I don't think this workflow works. I have no idea what stable's criteria are, and if you rely on people adding a cc you're going to miss stuff. There isn't really a standard criterion. It's up to each maintainer to be stricter or looser on what goes to stable. The criteria is pretty simple: Was the breakage in the last release already or was it introduced only intermittently. It simply does not scale to have Michael or other stable maintainers (mjt, me, ...) look at each commit from vX.Y to HEAD and decide whether to backport or not. That's why that task of flagging as backport *candidates* is pushed out to maintainers and recursively to authors where applicable, to reduce the number of commits to sift through and to allow to do this for actually committed patches rather than mails on the mailing list that might not get committed or change subject. What especially annoys me here is that Peter wants to play on Anthony's level on the project but is openly ignoring both our stable releases as a concept (we wouldn't need a release in the first place if we don't care about it working!) and the procedures decided in his presence at QEMU Summit (having maintainer/contributor flag it via Cc: line). If you feel the conclusion we reached there is not working out, feel free to bring this topic up on the KVM call later today - playing Rumpelstilzchen and exempting you from what everyone else is doing is not an acceptable solution. Either we all do it this way or we all decide on another way. It was not my suggestion, just a proposed solution to an issue that affects me, so I'm open to alternatives. Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
[Qemu-devel] [PATCH 7/8] hw/char/virtio-serial-bus: use virtio wrappers to access headers.
From: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Rusty Russell ru...@rustcorp.com.au Reviewed-by: Anthony Liguori aligu...@us.ibm.com --- hw/char/virtio-serial-bus.c | 34 ++ 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c index 226e9f9..243bf31 100644 --- a/hw/char/virtio-serial-bus.c +++ b/hw/char/virtio-serial-bus.c @@ -24,6 +24,7 @@ #include hw/sysbus.h #include trace.h #include hw/virtio/virtio-serial.h +#include hw/virtio/virtio-access.h static VirtIOSerialPort *find_port_by_id(VirtIOSerial *vser, uint32_t id) { @@ -185,9 +186,9 @@ static size_t send_control_event(VirtIOSerial *vser, uint32_t port_id, { struct virtio_console_control cpkt; -stl_p(cpkt.id, port_id); -stw_p(cpkt.event, event); -stw_p(cpkt.value, value); +virtio_stl_p(cpkt.id, port_id); +virtio_stw_p(cpkt.event, event); +virtio_stw_p(cpkt.value, value); trace_virtio_serial_send_control_event(port_id, event, value); return send_control_msg(vser, cpkt, sizeof(cpkt)); @@ -291,8 +292,8 @@ static void handle_control_message(VirtIOSerial *vser, void *buf, size_t len) return; } -cpkt.event = lduw_p(gcpkt-event); -cpkt.value = lduw_p(gcpkt-value); +cpkt.event = virtio_lduw_p(gcpkt-event); +cpkt.value = virtio_lduw_p(gcpkt-value); trace_virtio_serial_handle_control_message(cpkt.event, cpkt.value); @@ -312,10 +313,10 @@ static void handle_control_message(VirtIOSerial *vser, void *buf, size_t len) return; } -port = find_port_by_id(vser, ldl_p(gcpkt-id)); +port = find_port_by_id(vser, virtio_ldl_p(gcpkt-id)); if (!port) { error_report(virtio-serial-bus: Unexpected port id %u for device %s, - ldl_p(gcpkt-id), vser-bus.qbus.name); + virtio_ldl_p(gcpkt-id), vser-bus.qbus.name); return; } @@ -342,9 +343,9 @@ static void handle_control_message(VirtIOSerial *vser, void *buf, size_t len) } if (port-name) { -stl_p(cpkt.id, port-id); -stw_p(cpkt.event, VIRTIO_CONSOLE_PORT_NAME); -stw_p(cpkt.value, 1); +virtio_stl_p(cpkt.id, port-id); +virtio_stw_p(cpkt.event, VIRTIO_CONSOLE_PORT_NAME); +virtio_stw_p(cpkt.value, 1); buffer_len = sizeof(cpkt) + strlen(port-name) + 1; buffer = g_malloc(buffer_len); @@ -536,7 +537,7 @@ static void virtio_serial_save(QEMUFile *f, void *opaque) qemu_put_be32s(f, s-config.max_nr_ports); /* The ports map */ -max_nr_ports = tswap32(s-config.max_nr_ports); +max_nr_ports = virtio_tswap32(s-config.max_nr_ports); for (i = 0; i (max_nr_ports + 31) / 32; i++) { qemu_put_be32s(f, s-ports_map[i]); } @@ -690,8 +691,8 @@ static int virtio_serial_load(QEMUFile *f, void *opaque, int version_id) qemu_get_be16s(f, s-config.rows); qemu_get_be32s(f, max_nr_ports); -tswap32s(max_nr_ports); -if (max_nr_ports tswap32(s-config.max_nr_ports)) { +virtio_tswap32s(max_nr_ports); +if (max_nr_ports virtio_tswap32(s-config.max_nr_ports)) { /* Source could have had more ports than us. Fail migration. */ return -EINVAL; } @@ -760,7 +761,7 @@ static uint32_t find_free_port_id(VirtIOSerial *vser) { unsigned int i, max_nr_ports; -max_nr_ports = tswap32(vser-config.max_nr_ports); +max_nr_ports = virtio_tswap32(vser-config.max_nr_ports); for (i = 0; i (max_nr_ports + 31) / 32; i++) { uint32_t map, bit; @@ -846,7 +847,7 @@ static int virtser_port_qdev_init(DeviceState *qdev) } } -max_nr_ports = tswap32(port-vser-config.max_nr_ports); +max_nr_ports = virtio_tswap32(port-vser-config.max_nr_ports); if (port-id = max_nr_ports) { error_report(virtio-serial-bus: Out-of-range port id specified, max. allowed: %u, max_nr_ports - 1); @@ -949,7 +950,8 @@ static void virtio_serial_device_realize(DeviceState *dev, Error **errp) vser-ovqs[i] = virtio_add_queue(vdev, 128, handle_output); } -vser-config.max_nr_ports = tswap32(vser-serial.max_virtserial_ports); +vser-config.max_nr_ports = +virtio_tswap32(vser-serial.max_virtserial_ports); vser-ports_map = g_malloc0(((vser-serial.max_virtserial_ports + 31) / 32) * sizeof(vser-ports_map[0])); /*
[Qemu-devel] [PATCH 8/8] hw/9pfs/virtio_9p_device: use virtio wrappers to access headers.
Note that st*_raw and ld*_raw are effectively replaced by st*_p and ld*_p. Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com --- hw/9pfs/virtio-9p-device.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c index 15a4983..b3161ec 100644 --- a/hw/9pfs/virtio-9p-device.c +++ b/hw/9pfs/virtio-9p-device.c @@ -19,6 +19,7 @@ #include fsdev/qemu-fsdev.h #include virtio-9p-xattr.h #include virtio-9p-coth.h +#include hw/virtio/virtio-access.h static uint32_t virtio_9p_get_features(VirtIODevice *vdev, uint32_t features) { @@ -34,7 +35,7 @@ static void virtio_9p_get_config(VirtIODevice *vdev, uint8_t *config) len = strlen(s-tag); cfg = g_malloc0(sizeof(struct virtio_9p_config) + len); -stw_raw(cfg-tag_len, len); +virtio_stw_p(cfg-tag_len, len); /* We don't copy the terminating null to config space */ memcpy(cfg-tag, s-tag, len); memcpy(config, cfg, s-config_size);
Re: [Qemu-devel] [PULL 00/20] acpi,pc,pci fixes and enhancements
On Tue, 18 Feb 2014, Paolo Bonzini wrote: Il 18/02/2014 13:16, Stefano Stabellini ha scritto: It looks like that this series breaks disk unplug (hw/ide/piix.c:pci_piix3_xen_ide_unplug). I bisected it and the problem is caused by: commit 5e95494380ecf83c97d28f72134ab45e0cace8f9 Author: Igor Mammedov imamm...@redhat.com Date: Wed Feb 5 16:36:52 2014 +0100 hw/pci: switch to a generic hotplug handling for PCIDevice make qdev_unplug()/device_set_realized() to call hotplug handler's plug/unplug methods if available and remove not needed anymore hot(un)plug handling from PCIDevice. In case if hotplug handler is not available, revert to the legacy hotplug method for compatibility with not yet converted buses. Signed-off-by: Igor Mammedov imamm...@redhat.com Reviewed-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com What exactly breaks? Disk unplug: hw/ide/piix.c:pci_piix3_xen_ide_unplug (see the beginning of the email :-P). It is called by hw/xen/xen_platform.c:platform_fixed_ioport_writew, in response to the guest writing to a magic ioport specifically to unplug the emulated disk. With this patch after the guest boots I can still access both xvda and sda for the same disk, leading to fs corruptions.
Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
* mrhi...@linux.vnet.ibm.com (mrhi...@linux.vnet.ibm.com) wrote: From: Michael R. Hines mrhi...@us.ibm.com Wiki: http://wiki.qemu.org/Features/MicroCheckpointing Github: g...@github.com:hinesmr/qemu.git, 'mc' branch NOTE: This is a direct copy of the QEMU wiki page for the convenience of the review process. Since this series very much in flux, instead of maintaing two copies of documentation in two different formats, this documentation will be properly formatted in the future when the review process has completed. It seems to be picking up some truncations as well. +The Micro-Checkpointing Process +Basic Algorithm +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a live migration that never ends. As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps: + +1. After N milliseconds, stop the VM. +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU. +4. Resume the VM immediately so that it can make forward progress. +5. Transmit the checkpoint to the destination. +6. Repeat +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally. Later you talk about the memory allocation and how you grow the memory as needed to fit the checkpoint, have you tried going the other way and triggering the checkpoints sooner if they're taking too much memory? +1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination. + +2. MC over RDMA: Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed. + +In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed. + +If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint. + +If the destination is deemed to be lost, we perform the same action as a live migration: resume the sender normally and wait for management software to make a policy decision about whether or not to re-protect the VM, which may involve a third-party to identify a new destination host again to use as a backup for the VM. In this world what is making the decision about whether the sender/destination should win - how do you avoid a split brain situation where both VMs are running but the only thing that failed is the comms between them? Is there any guarantee that you'll have received knowledge of the comms failure before you pull the plug out and enable the corked packets to be sent on the sender side? snip +RDMA is used for two different reasons: + +1. Checkpoint generation (RDMA-based memcpy): +2. Checkpoint transmission +Checkpoint generation must be done while the VM is paused. In the worst case, the size of the checkpoint can be equal in size to the amount of memory in total use by the VM. In order to resume VM execution as fast as possible, the checkpoint is copied consistently locally into a staging area before transmission. A standard memcpy() of potentially such a large amount of memory not only gets no use out of the CPU cache but also potentially clogs up the CPU pipeline which would otherwise be useful by other neighbor VMs on the same physical node that could be scheduled for execution. To minimize the effect on neighbor VMs, we use RDMA to perform a local memcpy(), bypassing the host processor. On more recent processors, a 'beefy' enough memory bus architecture can move memory just as fast (sometimes faster) as a pure-software CPU-only optimized memcpy() from libc. However, on older computers, this feature only gives you the benefit of lower CPU-utilization at the expense of Isn't there a generic kernel DMA ABI for doing this type of thing (I think there was at one point, people have suggested things like using graphics cards to do it but I don't know if it ever happened). The other question is, do you always need to copy - what about something like COWing the pages? Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] Fwd: Trying to write a new device / virtio-i2c ?
2014-02-17 17:11 GMT+01:00 Paolo Bonzini pbonz...@redhat.com: Il 17/02/2014 16:33, Alex David ha scritto: If you need more than one bus, you need a new device exposing the I2C bus, besides the new sensor devices. USB-I2C could be one such device. So let me see if I understood well. USB-I2C (host QEMU device) seems a good idea, I could normally do : qemu-system-i386 -device usb-I2c,chardev=foo -device usb-i2c,chardev=bar -chardev socket,path=/tmp/test0,server,nowait,id=foo -chardev socket,path=/tmp/test1,server,nowait,id=bar. Almost. For QOM: -device usb-i2c,id=usb-i2c-0 -device i2c-my-sensor,address=0x48,bus=usb-i2c-0.0 For chardev: -device usb-i2c,id=usb-i2c-0 -chardev socket,path=/tmp/test0,server,nowait,id=chr-foo-0 -device i2c-my-sensor,address=0x48,bus=usb-i2c-0.0,chardev=chr-foo-0 Repeat for the other buses, replacing -0 with -1 and -2. I need a USB-I2C guest kernel driver that would register a bus (i2c-1 for chardev foo, i2c-2 for chardev bar etc...), I guess ? It exists already, drivers/i2c/busses/i2c-tiny-usb.c. I'm now trying to write this new USB-I2C device, I look through some of the code (dev-serial, dev-...), I used dev-serial code and removed the code that wanted to declare a chardev (I don't need one to create my bus, right ?). Now I'm wondering, how will be i2c-tiny-usb launched and declare a new /dev/i2c-N, with my device. I found : static struct usb_device_id i2c_tiny_usb_table [] = { { USB_DEVICE(0x0403, 0xc631) }, /* FTDI */ { USB_DEVICE(0x1c40, 0x0534) }, /* EZPrototypes */ { } /* Terminating entry */ }; So I figured my USB-I2C should register using these numbers ? static const USBDesc desc_i2c = { .id = { .idVendor = 0x0403, .idProduct = 0xc631, .bcdDevice = 0x0400, .iManufacturer = STR_MANUFACTURER, .iProduct = STR_PRODUCT_I2C, .iSerialNumber = STR_SERIALNUMBER, }, .full = desc_device, .str = desc_strings, }; However, this doesn't work, after a $ qemu-system-i386 debian_wheezy_i386_standard.qcow2 -usb -device USB-I2C,id=usb-i2c-0, the driver doesn't seem to be used (/sys/class/i2c-dev/i2c-0/name isn't for it...), even if I modprobe i2c-tiny-usb... The dmesg shows it doesn't go into the probe function. What did I miss ?
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 02/18/2014 01:37 PM, Peter Maydell wrote: On 18 February 2014 12:17, Alexander Graf ag...@suse.de wrote: On 02/18/2014 12:22 PM, Peter Maydell wrote: My criteria for ARM in the past has typically been there's a new release every three months, anything that got past the release testing process for release N is sufficiently non-critical it can just go into release N+1. Unfortunately this doesn't work for distributions. Distros need to maintain a stable branch for the lifetime of a release to ensure that we're reasonably regression free. If you indicate that this doesn't apply to ARM it basically means you admit that ARM systems are not yet ready for stable use by customers when they want to use KVM. At least at the point when we agree that customers do want to run on a stable base for virtualization on ARM we need a working -stable system for critical fixes. I agree in general that ARM support needs to move from its traditional this is just a dev tool situation to a broader level of support/stability guarantees for KVM. (We're not yet guaranteeing cross-version migration, for another example there.) Yup. I think it's reasonably to not declare ARM a stable target at the current point in time. But I think we agree that we want to change that - timeframe wise probably around the release after 2.0 at which point hopefully PCI and virtio 1.0 have settled. However again we run into the definition of what's a critical fix?. I think if distros need a stable branch then they need to be prepared to do the work of sorting through what counts as a critical fix that needs to be ported to that branch. (For instance, which boards and targets do they care about?) I think this is up for discussion. If I had to declare anything, I wouldn't consider anything but the virt machine as supported for example - similar to how x86 only really considers the pc machine type stable. For instance patch 3 only applies to the integrator board, and I don't consider the guest-to-host border to be a security boundary for most of our legacy board models: there's just too much unaudited device code for that to be trustable. Yes, I fully agree. Traditionally what I've done is to reply to patches that I consider stable material and nag the maintainer about CCing it. After a while people got so afraid of my emails that they started doing the CC themselves :). But in case of the integrator board I personally wouldn't bother ;). Alex
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
Am 18.02.2014 13:51, schrieb Alexander Graf: Traditionally what I've done is to reply to patches that I consider stable material and nag the maintainer about CCing it. After a while people got so afraid of my emails that they started doing the CC themselves :). But in case of the integrator board I personally wouldn't bother ;). Well, I have been doing exactly that here, asking Peter to add the CC, and still Peter is refusing to add that single line before applying the patch to his branch... :( Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 02/18/2014 01:53 PM, Andreas Färber wrote: Am 18.02.2014 13:51, schrieb Alexander Graf: Traditionally what I've done is to reply to patches that I consider stable material and nag the maintainer about CCing it. After a while people got so afraid of my emails that they started doing the CC themselves :). But in case of the integrator board I personally wouldn't bother ;). Well, I have been doing exactly that here, asking Peter to add the CC, and still Peter is refusing to add that single line before applying the patch to his branch... :( Peter, can you please put CC stable on them? Let's consider this a training exercise on how to add a stable tag and call it a day ;). Alex
Re: [Qemu-devel] [PATCH v4] qdev: Keep global allocation counter per bus
Am 18.02.2014 13:54, schrieb Markus Armbruster: Peter, can you merge this patch? v2 is from December 20, and the changes since then have been limited to the commit message. In case Peter doesn't want to take it directly: Andreas, would you be willing to take it through your tree? It's not really QOM, but your tree is the closest fit I can find. Last thing I read was some open discussions related to migration problems, so I didn't take it yet. If those are resolved, as a qdev patch I would take it through qom-next, if mst acks for the PC part. Regards, Andreas Markus Armbruster arm...@redhat.com writes: From: Alexander Graf ag...@suse.de When we have 2 separate qdev devices that both create a qbus of the same type without specifying a bus name or device name, we end up with two buses of the same name, such as ide.0 on the Mac machines: dev: macio-ide, id bus: ide.0 type IDE dev: macio-ide, id bus: ide.0 type IDE If we now spawn a device that connects to a ide.0 the last created bus gets the device, with the first created bus inaccessible to the command line. After some discussion on IRC we concluded that the best quick fix way forward for this is to make automated bus-class type based allocation count a global counter. That's what this patch implements. With this we instead get dev: macio-ide, id bus: ide.1 type IDE dev: macio-ide, id bus: ide.0 type IDE on the example mentioned above. This also means that if you did -device ...,bus=ide.0 you got a device on the first bus (the last created one) before this patch and get that device on the second one (the first created one) now. Breaks migration unless you change bus=ide.0 to bus=ide.1 on the destination. This is intended and makes the bus enumeration work as expected. As per review request follows a list of otherwise affected boards and the reasoning for the conclusion that they are ok: target machine bus id times -- --- -- - aarch64 n800i2c-bus.0 2 aarch64 n810i2c-bus.0 2 arm n800i2c-bus.0 2 arm n810i2c-bus.0 2 - Devices are only created explicitly on one of the two buses, using s-mpu-i2c[0], so no change to the guest. aarch64 vexpress-a15virtio-mmio-bus.0 4 aarch64 vexpress-a9 virtio-mmio-bus.0 4 aarch64 virtvirtio-mmio-bus.0 32 arm vexpress-a15virtio-mmio-bus.0 4 arm vexpress-a9 virtio-mmio-bus.0 4 arm virtvirtio-mmio-bus.0 32 - Makes -device bus= work for all virtio-mmio buses. Breaks migration. Workaround for migration from old to new: specify virtio-mmio-bus.4 or .32 respectively rather than .0 on the destination. aarch64 xilinx-zynq-a9 usb-bus.0 2 arm xilinx-zynq-a9 usb-bus.0 2 mips64elfulong2eusb-bus.0 2 - Normal USB operation not affected. Migration driver needs command line to use the other bus. i386isapc ide.0 2 x86_64 isapc ide.0 2 mipsmipside.0 2 mips64 mipside.0 2 mips64elmipside.0 2 mipsel mipside.0 2 ppc g3beige ide.0 2 ppc mac99 ide.0 2 ppc prepide.0 2 ppc64 g3beige ide.0 2 ppc64 mac99 ide.0 2 ppc64 prepide.0 2 - Makes -device bus= work for all IDE buses. Breaks migration. Workaround for migration from old to new: specify ide.1 rather than ide.0 on the destination. CC: Paolo Bonzini pbonz...@redhat.com CC: Anthony Liguori aligu...@amazon.com Signed-off-by: Alexander Graf ag...@suse.de Signed-off-by: Markus Armbruster arm...@redhat.com --- v1 - v2: - add fix for isapc which was searching for 2 buses called ide.0 - explain the semantic change more in the commit message v2 - v3: - add board list to commit message v3 - v4: - Explain impact on migration more clearly in commit message. hw/core/qdev.c | 20 +--- hw/i386/pc_piix.c | 8 +++- include/hw/qdev-core.h | 2 ++ 3 files changed, 22 insertions(+), 8 deletions(-) diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 82a9123..e7985fe 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -430,27 +430,33 @@ DeviceState *qdev_find_recursive(BusState *bus, const char *id) static void qbus_realize(BusState *bus, DeviceState *parent, const char *name) { const char *typename =
Re: [Qemu-devel] [PATCH v4] qdev: Keep global allocation counter per bus
Peter, can you merge this patch? v2 is from December 20, and the changes since then have been limited to the commit message. In case Peter doesn't want to take it directly: Andreas, would you be willing to take it through your tree? It's not really QOM, but your tree is the closest fit I can find. Markus Armbruster arm...@redhat.com writes: From: Alexander Graf ag...@suse.de When we have 2 separate qdev devices that both create a qbus of the same type without specifying a bus name or device name, we end up with two buses of the same name, such as ide.0 on the Mac machines: dev: macio-ide, id bus: ide.0 type IDE dev: macio-ide, id bus: ide.0 type IDE If we now spawn a device that connects to a ide.0 the last created bus gets the device, with the first created bus inaccessible to the command line. After some discussion on IRC we concluded that the best quick fix way forward for this is to make automated bus-class type based allocation count a global counter. That's what this patch implements. With this we instead get dev: macio-ide, id bus: ide.1 type IDE dev: macio-ide, id bus: ide.0 type IDE on the example mentioned above. This also means that if you did -device ...,bus=ide.0 you got a device on the first bus (the last created one) before this patch and get that device on the second one (the first created one) now. Breaks migration unless you change bus=ide.0 to bus=ide.1 on the destination. This is intended and makes the bus enumeration work as expected. As per review request follows a list of otherwise affected boards and the reasoning for the conclusion that they are ok: target machine bus id times -- --- -- - aarch64 n800i2c-bus.0 2 aarch64 n810i2c-bus.0 2 arm n800i2c-bus.0 2 arm n810i2c-bus.0 2 - Devices are only created explicitly on one of the two buses, using s-mpu-i2c[0], so no change to the guest. aarch64 vexpress-a15virtio-mmio-bus.0 4 aarch64 vexpress-a9 virtio-mmio-bus.0 4 aarch64 virtvirtio-mmio-bus.0 32 arm vexpress-a15virtio-mmio-bus.0 4 arm vexpress-a9 virtio-mmio-bus.0 4 arm virtvirtio-mmio-bus.0 32 - Makes -device bus= work for all virtio-mmio buses. Breaks migration. Workaround for migration from old to new: specify virtio-mmio-bus.4 or .32 respectively rather than .0 on the destination. aarch64 xilinx-zynq-a9 usb-bus.0 2 arm xilinx-zynq-a9 usb-bus.0 2 mips64elfulong2eusb-bus.0 2 - Normal USB operation not affected. Migration driver needs command line to use the other bus. i386isapc ide.0 2 x86_64 isapc ide.0 2 mipsmipside.0 2 mips64 mipside.0 2 mips64elmipside.0 2 mipsel mipside.0 2 ppc g3beige ide.0 2 ppc mac99 ide.0 2 ppc prepide.0 2 ppc64 g3beige ide.0 2 ppc64 mac99 ide.0 2 ppc64 prepide.0 2 - Makes -device bus= work for all IDE buses. Breaks migration. Workaround for migration from old to new: specify ide.1 rather than ide.0 on the destination. CC: Paolo Bonzini pbonz...@redhat.com CC: Anthony Liguori aligu...@amazon.com Signed-off-by: Alexander Graf ag...@suse.de Signed-off-by: Markus Armbruster arm...@redhat.com --- v1 - v2: - add fix for isapc which was searching for 2 buses called ide.0 - explain the semantic change more in the commit message v2 - v3: - add board list to commit message v3 - v4: - Explain impact on migration more clearly in commit message. hw/core/qdev.c | 20 +--- hw/i386/pc_piix.c | 8 +++- include/hw/qdev-core.h | 2 ++ 3 files changed, 22 insertions(+), 8 deletions(-) diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 82a9123..e7985fe 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -430,27 +430,33 @@ DeviceState *qdev_find_recursive(BusState *bus, const char *id) static void qbus_realize(BusState *bus, DeviceState *parent, const char *name) { const char *typename = object_get_typename(OBJECT(bus)); +BusClass *bc; char *buf; -int i,len; +int i, len, bus_id; bus-parent = parent; if (name) { bus-name = g_strdup(name); } else if (bus-parent bus-parent-id) { -/* parent device has
[Qemu-devel] [PATCH V18 07/12] quorum: Add quorum_getlength().
From: Benoît Canet ben...@irqsave.net Check that every bs file returns the same length. Otherwise, return -EIO to disable the quorum and avoid length discrepancy. Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Max Reitz mre...@redhat.com --- block/quorum.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index 7acaa97..131ffbf 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -595,12 +595,38 @@ static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs, return acb-common; } +static int64_t quorum_getlength(BlockDriverState *bs) +{ +BDRVQuorumState *s = bs-opaque; +int64_t result; +int i; + +/* check that all file have the same length */ +result = bdrv_getlength(s-bs[0]); +if (result 0) { +return result; +} +for (i = 1; i s-num_children; i++) { +int64_t value = bdrv_getlength(s-bs[i]); +if (value 0) { +return value; +} +if (value != result) { +return -EIO; +} +} + +return result; +} + static BlockDriver bdrv_quorum = { .format_name= quorum, .protocol_name = quorum, .instance_size = sizeof(BDRVQuorumState), +.bdrv_getlength = quorum_getlength, + .bdrv_aio_readv = quorum_aio_readv, .bdrv_aio_writev= quorum_aio_writev, }; -- 1.8.3.2
Re: [Qemu-devel] [PATCH 0/3] ARM: three easy patches for coverity-reported issues
On 18 February 2014 12:44, Andreas Färber afaer...@suse.de wrote: What especially annoys me here is that Peter wants to play on Anthony's level on the project but is openly ignoring both our stable releases as a concept (we wouldn't need a release in the first place if we don't care about it working!) and the procedures decided in his presence at QEMU Summit (having maintainer/contributor flag it via Cc: line). If you feel the conclusion we reached there is not working out, feel free to bring this topic up on the KVM call later today - playing Rumpelstilzchen and exempting you from what everyone else is doing is not an acceptable solution. Either we all do it this way or we all decide on another way. It was not my suggestion, just a proposed solution to an issue that affects me, so I'm open to alternatives. I'm just pointing out (as I do pretty much any time somebody says hey you should have cc'd stable) that this workflow isn't working for me as a contributor or as a submaintainer. If you can document and define how it's actually supposed to work and what the policy is for what counts as a bugfix that should target stable (I couldn't find anything on the wiki) then that might help. We should probably discuss this on the call, yes. thanks -- PMM
Re: [Qemu-devel] Fwd: Trying to write a new device / virtio-i2c ?
Il 18/02/2014 13:48, Alex David ha scritto: 2014-02-17 17:11 GMT+01:00 Paolo Bonzini pbonz...@redhat.com mailto:pbonz...@redhat.com: Il 17/02/2014 16:33, Alex David ha scritto: If you need more than one bus, you need a new device exposing the I2C bus, besides the new sensor devices. USB-I2C could be one such device. So let me see if I understood well. USB-I2C (host QEMU device) seems a good idea, I could normally do : qemu-system-i386 -device usb-I2c,chardev=foo -device usb-i2c,chardev=bar -chardev socket,path=/tmp/test0,server,__nowait,id=foo -chardev socket,path=/tmp/test1,server,__nowait,id=bar. Almost. For QOM: -device usb-i2c,id=usb-i2c-0 -device i2c-my-sensor,address=0x48,__bus=usb-i2c-0.0 For chardev: -device usb-i2c,id=usb-i2c-0 -chardev socket,path=/tmp/test0,server,__nowait,id=chr-foo-0 -device i2c-my-sensor,address=0x48,__bus=usb-i2c-0.0,chardev=chr-__foo-0 Repeat for the other buses, replacing -0 with -1 and -2. I need a USB-I2C guest kernel driver that would register a bus (i2c-1 for chardev foo, i2c-2 for chardev bar etc...), I guess ? It exists already, drivers/i2c/busses/i2c-tiny-usb.c. I'm now trying to write this new USB-I2C device, I look through some of the code (dev-serial, dev-...), I used dev-serial code and removed the code that wanted to declare a chardev (I don't need one to create my bus, right ?). Now I'm wondering, how will be i2c-tiny-usb launched and declare a new /dev/i2c-N, with my device. I found : static struct usb_device_id i2c_tiny_usb_table [] = { { USB_DEVICE(0x0403, 0xc631) }, /* FTDI */ { USB_DEVICE(0x1c40, 0x0534) }, /* EZPrototypes */ { } /* Terminating entry */ }; So I figured my USB-I2C should register using these numbers ? static const USBDesc desc_i2c = { .id = { .idVendor = 0x0403, .idProduct = 0xc631, .bcdDevice = 0x0400, .iManufacturer = STR_MANUFACTURER, .iProduct = STR_PRODUCT_I2C, .iSerialNumber = STR_SERIALNUMBER, }, .full = desc_device, .str = desc_strings, }; However, this doesn't work, after a $ qemu-system-i386 debian_wheezy_i386_standard.qcow2 -usb -device USB-I2C,id=usb-i2c-0, the driver doesn't seem to be used (/sys/class/i2c-dev/i2c-0/name isn't for it...), even if I modprobe i2c-tiny-usb... The dmesg shows it doesn't go into the probe function. What did I miss ? I honestly don't know, you'll have to debug it. Paolo
Re: [Qemu-devel] [PATCH 2/8] virtio: allow byte swapping for vring and config access
On Tue, 18 Feb 2014 13:38:54 +0100 Greg Kurz gk...@linux.vnet.ibm.com wrote: From: Rusty Russell ru...@rustcorp.com.au This is based on a simpler patch by Anthony Liguouri, which only handled the vring accesses. We also need some drivers to access these helpers, eg. for data which contains headers. [ ldq_phys() API change, Greg Kurz gk...@linux.vnet.ibm.com ] Signed-off-by: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com --- hw/virtio/virtio.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) static inline void vring_used_flags_unset_bit(VirtQueue *vq, int mask) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, flags); -stw_phys(address_space_memory, - pa, lduw_phys(address_space_memory, pa) ~mask); +virtio_stw_phys(address_space_memory, +pa, lduw_phys(address_space_memory, pa) ~mask); } This needs to be virtio_lduw_phys(), no?
Re: [Qemu-devel] [PULL 00/20] acpi,pc,pci fixes and enhancements
On Tue, 18 Feb 2014 12:45:29 + Stefano Stabellini stefano.stabell...@eu.citrix.com wrote: On Tue, 18 Feb 2014, Paolo Bonzini wrote: Il 18/02/2014 13:16, Stefano Stabellini ha scritto: It looks like that this series breaks disk unplug (hw/ide/piix.c:pci_piix3_xen_ide_unplug). I bisected it and the problem is caused by: commit 5e95494380ecf83c97d28f72134ab45e0cace8f9 Author: Igor Mammedov imamm...@redhat.com Date: Wed Feb 5 16:36:52 2014 +0100 hw/pci: switch to a generic hotplug handling for PCIDevice make qdev_unplug()/device_set_realized() to call hotplug handler's plug/unplug methods if available and remove not needed anymore hot(un)plug handling from PCIDevice. In case if hotplug handler is not available, revert to the legacy hotplug method for compatibility with not yet converted buses. Signed-off-by: Igor Mammedov imamm...@redhat.com Reviewed-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com What exactly breaks? Disk unplug: hw/ide/piix.c:pci_piix3_xen_ide_unplug (see the beginning of the email :-P). It is called by hw/xen/xen_platform.c:platform_fixed_ioport_writew, in response to the guest writing to a magic ioport specifically to unplug the emulated disk. With this patch after the guest boots I can still access both xvda and sda for the same disk, leading to fs corruptions. Could you try with following debug patch? diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 64b66e0..84aa8be 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -214,6 +214,7 @@ void qdev_unplug(DeviceState *dev, Error **errp) return; } +fprintf(stderr, dc-hotpluggable %d\n, dc-hotpluggable); if (!dc-hotpluggable) { error_set(errp, QERR_DEVICE_NO_HOTPLUG, object_get_typename(OBJECT(dev))); @@ -223,8 +224,12 @@ void qdev_unplug(DeviceState *dev, Error **errp) qdev_hot_removed = true; if (dev-parent_bus dev-parent_bus-hotplug_handler) { +fprintf(stderr, bus name: %s, hotplug_handler: %s\n, +dev-parent_bus-name, +object_get_typename(OBJECT(dev-parent_bus-hotplug_handler))); hotplug_handler_unplug(dev-parent_bus-hotplug_handler, dev, errp); } else { +fprintf(stderr, legacy unplug: %p\n, dc-unplug); assert(dc-unplug != NULL); if (dc-unplug(dev) 0) { /* legacy handler */ error_set(errp, QERR_UNDEFINED_ERROR); diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c index 1acd2b2..74b0cac 100644 --- a/hw/i386/pc_piix.c +++ b/hw/i386/pc_piix.c @@ -238,6 +238,7 @@ static void pc_init1(QEMUMachineInitArgs *args, if (pci_enabled acpi_enabled) { i2c_bus *smbus; + fprintf(stderr, create piix4_pm_init\n); smi_irq = qemu_allocate_irqs(pc_acpi_smi_interrupt, first_cpu, 1); /* TODO: Populate SPD eeprom data. */ smbus = piix4_pm_init(pci_bus, piix3_devfn + 3, 0xb100,
Re: [Qemu-devel] [PULL 00/20] acpi,pc,pci fixes and enhancements
Il 18/02/2014 13:45, Stefano Stabellini ha scritto: Disk unplug: hw/ide/piix.c:pci_piix3_xen_ide_unplug (see the beginning of the email :-P). It is called by hw/xen/xen_platform.c:platform_fixed_ioport_writew, in response to the guest writing to a magic ioport specifically to unplug the emulated disk. With this patch after the guest boots I can still access both xvda and sda for the same disk, leading to fs corruptions. Ok, the last paragraph is what I was missing. So this is dc-unplug for the PIIX3 IDE device. Because PCI declares a hotplug handler, dc-unplug is not called anymore. But unlike other dc-unplug callbacks, pci_piix3_xen_ide_unplug doesn't free the device, it just drops the disks underneath. I think the simplest solution is to _not_ make it a dc-unplug callback at all, and call pci_piix3_xen_ide_unplug from unplug_disks instead of qdev_unplug. qdev_unplug means ask guest to start unplug, which is not what Xen wants to do here. Paolo
Re: [Qemu-devel] [PATCH 2/8] virtio: allow byte swapping for vring and config access
On Tue, 18 Feb 2014 14:08:35 +0100 Cornelia Huck cornelia.h...@de.ibm.com wrote: On Tue, 18 Feb 2014 13:38:54 +0100 Greg Kurz gk...@linux.vnet.ibm.com wrote: From: Rusty Russell ru...@rustcorp.com.au This is based on a simpler patch by Anthony Liguouri, which only handled the vring accesses. We also need some drivers to access these helpers, eg. for data which contains headers. [ ldq_phys() API change, Greg Kurz gk...@linux.vnet.ibm.com ] Signed-off-by: Rusty Russell ru...@rustcorp.com.au Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com --- hw/virtio/virtio.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) static inline void vring_used_flags_unset_bit(VirtQueue *vq, int mask) { hwaddr pa; pa = vq-vring.used + offsetof(VRingUsed, flags); -stw_phys(address_space_memory, - pa, lduw_phys(address_space_memory, pa) ~mask); +virtio_stw_phys(address_space_memory, +pa, lduw_phys(address_space_memory, pa) ~mask); } This needs to be virtio_lduw_phys(), no? Oops yes it should be... my mistake. :) -- Gregory Kurz kurzg...@fr.ibm.com gk...@linux.vnet.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)562 165 496 Anarchy is about taking complete responsibility for yourself. Alan Moore.