Re: [PATCH V6 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
On 08/07/2012 05:15 PM, Christoph Hellwig wrote: At least after review is done I really think this patch sopuld be folded into the previous one. OK. Some more comments below: @@ -58,6 +58,12 @@ struct virtblk_req struct bio *bio; struct virtio_blk_outhdr out_hdr; struct virtio_scsi_inhdr in_hdr; + struct work_struct work; + struct virtio_blk *vblk; I think using bio-bi_private for the virtio_blk pointer would be cleaner. I wish I could use bio-bi_private but I am seeing this when using bio-bi_priate to store virito_blk pointer: [1.100335] Call Trace: [1.100335] IRQ [1.100335] [811dd4b0] ? end_bio_bh_io_sync+0x30/0x50 [1.100335] [811e167d] bio_endio+0x1d/0x40 [1.100335] [81551fb2] virtblk_done+0xa2/0x260 [1.100335] [813d714d] vring_interrupt+0x2d/0x40 [1.100335] [81119c0d] handle_irq_event_percpu+0x6d/0x210 [1.100335] [81119df1] handle_irq_event+0x41/0x70 [1.100335] [8111d2c9] handle_edge_irq+0x69/0x120 [1.100335] [810613a2] handle_irq+0x22/0x30 [1.100335] [81aadccd] do_IRQ+0x5d/0xe0 [1.100335] [81aa432f] common_interrupt+0x6f/0x6f end_bio_bh_io_sync() uses bio-private: struct buffer_head *bh = bio-bi_private; + bool is_flush; + bool req_flush; + bool req_data; + bool req_fua; This could be a bitmap, or better even a single state field. Will use a bitmap for now. +static int virtblk_bio_send_flush(struct virtio_blk *vblk, + struct virtblk_req *vbr) +static int virtblk_bio_send_data(struct virtio_blk *vblk, +struct virtblk_req *vbr) These should only get the struct virtblk_req * argument as the virtio_blk structure is easily derivable from it. Yes. Will clean it up. +static inline void virtblk_bio_done_flush(struct virtio_blk *vblk, + struct virtblk_req *vbr) { + if (vbr-req_data) { + /* Send out the actual write data */ + struct virtblk_req *_vbr; + _vbr = virtblk_alloc_req(vblk, GFP_NOIO); + if (!_vbr) { + bio_endio(vbr-bio, -ENOMEM); + goto out; + } + _vbr-req_fua = vbr-req_fua; + _vbr-bio = vbr-bio; + _vbr-vblk = vblk; + INIT_WORK(_vbr-work, virtblk_bio_send_data_work); + queue_work(virtblk_wq, _vbr-work); The _vbr naming isn't too nice. Also can you explain why the original request can't be reused in a comment here Ah, the original request can be reused. Will fix this. Also if using a state variable I think the whole code would be a bit cleaner if the bio_done helpers are combined. - if (writeback !use_bio) + if (writeback) blk_queue_flush(vblk-disk-queue, REQ_FLUSH); Shouldn't this be REQ_FLUSH | REQ_FUA for the bio case? Without REQ_FUA, I am also seeing bio with REQ_FUA flag set. Do we need to set REQ_FUA explicitly? Btw, did you verify that flushes really work correctly for all cases using tracing in qemu? I added some debug code in both kernel and kvm tool to verity the flush. -- Asias -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 0/2] Improve virtio-blk performance
On 08/08/2012 11:09 AM, Asias He wrote: On 08/07/2012 05:16 PM, Christoph Hellwig wrote: On Tue, Aug 07, 2012 at 04:47:13PM +0800, Asias He wrote: 1) Ramdisk device With bio-based IO path, sequential read/write, random read/write IOPS boost : 28%, 24%, 21%, 16% Latency improvement: 32%, 17%, 21%, 16% 2) Fusion IO device With bio-based IO path, sequential read/write, random read/write IOPS boost : 11%, 11%, 13%, 10% Latency improvement: 10%, 10%, 12%, 10% Do you also have numbers for normal SAS/SATA disks? The changelog should have a reall good explanation of why this is optional, and numbers are a very important part of that. Yes. I posted the numbers on normal SATA disks a few days ago in the thread with Sasha. Will add that data and explanation of why optional to the changelog -- Asias -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/15] qom: using atomic ops to re-implement object_ref
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/object.h |3 ++- qom/object.c | 13 + 2 files changed, 7 insertions(+), 9 deletions(-) diff --git a/include/qemu/object.h b/include/qemu/object.h index 8b17776..58db9d0 100644 --- a/include/qemu/object.h +++ b/include/qemu/object.h @@ -18,6 +18,7 @@ #include stdint.h #include stdbool.h #include qemu-queue.h +#include qemu/atomic.h struct Visitor; struct Error; @@ -262,7 +263,7 @@ struct Object ObjectClass *class; GSList *interfaces; QTAILQ_HEAD(, ObjectProperty) properties; -uint32_t ref; +Atomic ref; Object *parent; }; diff --git a/qom/object.c b/qom/object.c index 00bb3b0..822bdb7 100644 --- a/qom/object.c +++ b/qom/object.c @@ -378,7 +378,7 @@ void object_finalize(void *data) object_deinit(obj, ti); object_property_del_all(obj); -g_assert(obj-ref == 0); +g_assert(atomic_read(obj-ref) == 0); } Object *object_new_with_type(Type type) @@ -405,7 +405,7 @@ Object *object_new(const char *typename) void object_delete(Object *obj) { object_unref(obj); -g_assert(obj-ref == 0); +g_assert(atomic_read(obj-ref) == 0); g_free(obj); } @@ -639,16 +639,13 @@ GSList *object_class_get_list(const char *implements_type, void object_ref(Object *obj) { -obj-ref++; +atomic_inc(obj-ref); } void object_unref(Object *obj) { -g_assert(obj-ref 0); -obj-ref--; - -/* parent always holds a reference to its children */ -if (obj-ref == 0) { +g_assert(atomic_read(obj-ref) 0); +if (atomic_dec_and_test(obj-ref)) { object_finalize(obj); } } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/15] qom: introduce reclaimer to release obj
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/reclaimer.h | 28 ++ main-loop.c |5 qemu-tool.c |5 qom/Makefile.objs|2 +- qom/reclaimer.c | 58 ++ 5 files changed, 97 insertions(+), 1 deletions(-) create mode 100644 include/qemu/reclaimer.h create mode 100644 qom/reclaimer.c diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h new file mode 100644 index 000..9307e93 --- /dev/null +++ b/include/qemu/reclaimer.h @@ -0,0 +1,28 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef QEMU_RECLAIMER +#define QEMU_RECLAIMER + +typedef void ReleaseHandler(void *opaque); +typedef struct Chunk { +QLIST_ENTRY(Chunk) list; +void *opaque; +ReleaseHandler *release; +} Chunk; + +typedef struct ChunkHead { +struct Chunk *lh_first; +} ChunkHead; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release); +void reclaimer_worker(ChunkHead *head); +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release); +void qemu_reclaimer(void); +#endif diff --git a/main-loop.c b/main-loop.c index eb3b6e6..be9d095 100644 --- a/main-loop.c +++ b/main-loop.c @@ -26,6 +26,7 @@ #include qemu-timer.h #include slirp/slirp.h #include main-loop.h +#include qemu/reclaimer.h #ifndef _WIN32 @@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking) them. */ qemu_bh_poll(); +/* ref to device from iohandler/bh/timer do not obey the rules, so delay + * reclaiming until now. + */ +qemu_reclaimer(); return ret; } diff --git a/qemu-tool.c b/qemu-tool.c index 318c5fc..f5fe319 100644 --- a/qemu-tool.c +++ b/qemu-tool.c @@ -21,6 +21,7 @@ #include main-loop.h #include qemu_socket.h #include slirp/libslirp.h +#include qemu/reclaimer.h #include sys/time.h @@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void) { } +void qemu_reclaimer(void) +{ +} + int use_icount; void qemu_clock_warp(QEMUClock *clock) diff --git a/qom/Makefile.objs b/qom/Makefile.objs index 5ef060a..a579261 100644 --- a/qom/Makefile.objs +++ b/qom/Makefile.objs @@ -1,4 +1,4 @@ -qom-obj-y = object.o container.o qom-qobject.o +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o qom-obj-twice-y = cpu.o common-obj-y = $(qom-obj-twice-y) user-obj-y = $(qom-obj-twice-y) diff --git a/qom/reclaimer.c b/qom/reclaimer.c new file mode 100644 index 000..6cb53e3 --- /dev/null +++ b/qom/reclaimer.c @@ -0,0 +1,58 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include qemu-common.h +#include qemu-thread.h +#include main-loop.h +#include qemu-queue.h +#include qemu/reclaimer.h + +static struct QemuMutex reclaimer_lock; +static QLIST_HEAD(rcl, Chunk) reclaimer_list; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +QLIST_INSERT_HEAD_RCU(head, r, list); +} + +void reclaimer_worker(ChunkHead *head) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, head, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} +} + +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +qemu_mutex_lock(reclaimer_lock); +QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list); +qemu_mutex_unlock(reclaimer_lock); +} + + +void qemu_reclaimer(void) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} +} -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/15] memory: MemoryRegion topology must be stable when updating
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using mem_map_lock to protect among updaters. So we can get the intact snapshot of mem topology -- FlatView radix-tree. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |3 +++ memory.c | 22 ++ memory.h |2 ++ 3 files changed, 27 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 8244d54..0e29ef9 100644 --- a/exec.c +++ b/exec.c @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; The bottom level has pointers to MemoryRegionSections. */ static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; +QemuMutex mem_map_lock; + static void io_mem_init(void); static void memory_map_init(void); @@ -637,6 +639,7 @@ void cpu_exec_init_all(void) #if !defined(CONFIG_USER_ONLY) memory_map_init(); io_mem_init(); +qemu_mutex_init(mem_map_lock); #endif } diff --git a/memory.c b/memory.c index aab4a31..5986532 100644 --- a/memory.c +++ b/memory.c @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void) assert(memory_region_transaction_depth); --memory_region_transaction_depth; if (!memory_region_transaction_depth memory_region_update_pending) { +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } } @@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client) { uint8_t mask = 1 client; +qemu_mutex_lock(mem_map_lock); mr-dirty_log_mask = (mr-dirty_log_mask ~mask) | (log * mask); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr, @@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr) void memory_region_set_readonly(MemoryRegion *mr, bool readonly) { if (mr-readonly != readonly) { +qemu_mutex_lock(mem_map_lock); mr-readonly = readonly; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1112,7 +1118,9 @@ void memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable) { if (mr-readable != readable) { mr-readable = readable; +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) { break; @@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i)); mr-ioeventfds[i] = mrfd; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } void memory_region_del_eventfd(MemoryRegion *mr, @@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) { break; @@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, mr-ioeventfds = g_realloc(mr-ioeventfds, sizeof(*mr-ioeventfds)*mr-ioeventfd_nb + 1); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } static void memory_region_add_subregion_common(MemoryRegion *mr, @@ -1259,6 +1271,8 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, assert(!subregion-parent); subregion-parent = mr; subregion-addr = offset; + +qemu_mutex_lock(mem_map_lock); QTAILQ_FOREACH(other, mr-subregions, subregions_link) { if (subregion-may_overlap || other-may_overlap) { continue; @@ -1289,6 +1303,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link); done: memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } @@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr, { assert(subregion-parent == mr); subregion-parent = NULL; + +qemu_mutex_lock(mem_map_lock); QTAILQ_REMOVE(mr-subregions, subregion, subregions_link); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } void memory_region_set_enabled(MemoryRegion *mr, bool enabled) @@ -1325,8 +1343,10 @@ void memory_region_set_enabled(MemoryRegion *mr, bool enabled) if (enabled == mr-enabled) { return; } +qemu_mutex_lock(mem_map_lock); mr-enabled = enabled; memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } void memory_region_set_address(MemoryRegion *mr, target_phys_addr_t
[PATCH 0/15 v2] prepare unplug out of protection of global lock
background: refer to orignal plan posted by Marcelo Tosatti, http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html prev version: https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03312.html changes v1-v2 --introduce atomic ops --introduce ref cnt for MemoryRegion --make memory's flat view and radix tree toward rcu style. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/15] memory: introduce life_ops to MemoryRegion
From: Liu Ping Fan pingf...@linux.vnet.ibm.com The types of referred object by MemoryRegion are variable, ex, another mr, DeviceState, or other struct defined by drivers. So the refer/unrefer may be different by drivers. Using this ops, we can mange the backend object. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/ide/piix.c |6 ++-- hw/pckbd.c|6 +++- hw/serial.c |2 +- ioport.c |3 +- memory.c | 69 - memory.h | 16 + 6 files changed, 94 insertions(+), 8 deletions(-) diff --git a/hw/ide/piix.c b/hw/ide/piix.c index f5a74c2..bdd70b1 100644 --- a/hw/ide/piix.c +++ b/hw/ide/piix.c @@ -93,11 +93,11 @@ static void bmdma_setup_bar(PCIIDEState *d) for(i = 0;i 2; i++) { BMDMAState *bm = d-bmdma[i]; -memory_region_init_io(bm-extra_io, piix_bmdma_ops, bm, +memory_region_init_io_ext(bm-extra_io, piix_bmdma_ops, NULL, bm, piix-bmdma, 4); memory_region_add_subregion(d-bmdma_bar, i * 8, bm-extra_io); -memory_region_init_io(bm-addr_ioport, bmdma_addr_ioport_ops, bm, - bmdma, 4); +memory_region_init_io_ext(bm-addr_ioport, bmdma_addr_ioport_ops, + NULL, bm, bmdma, 4); memory_region_add_subregion(d-bmdma_bar, i * 8 + 4, bm-addr_ioport); } } diff --git a/hw/pckbd.c b/hw/pckbd.c index 69857ba..de3c46d 100644 --- a/hw/pckbd.c +++ b/hw/pckbd.c @@ -485,10 +485,12 @@ static int i8042_initfn(ISADevice *dev) isa_init_irq(dev, s-irq_kbd, 1); isa_init_irq(dev, s-irq_mouse, 12); -memory_region_init_io(isa_s-io + 0, i8042_data_ops, s, i8042-data, 1); +memory_region_init_io_ext(isa_s-io + 0, i8042_data_ops, NULL, s, +i8042-data, 1); isa_register_ioport(dev, isa_s-io + 0, 0x60); -memory_region_init_io(isa_s-io + 1, i8042_cmd_ops, s, i8042-cmd, 1); +memory_region_init_io_ext(isa_s-io + 1, i8042_cmd_ops, NULL, s, +i8042-cmd, 1); isa_register_ioport(dev, isa_s-io + 1, 0x64); s-kbd = ps2_kbd_init(kbd_update_kbd_irq, s); diff --git a/hw/serial.c b/hw/serial.c index a421d1e..e992c6a 100644 --- a/hw/serial.c +++ b/hw/serial.c @@ -794,7 +794,7 @@ static int serial_isa_initfn(ISADevice *dev) serial_init_core(s); qdev_set_legacy_instance_id(dev-qdev, isa-iobase, 3); -memory_region_init_io(s-io, serial_io_ops, s, serial, 8); +memory_region_init_io_ext(s-io, serial_io_ops, NULL, s, serial, 8); isa_register_ioport(dev, s-io, isa-iobase); return 0; } diff --git a/ioport.c b/ioport.c index 6e4ca0d..768e271 100644 --- a/ioport.c +++ b/ioport.c @@ -384,7 +384,8 @@ static void portio_list_add_1(PortioList *piolist, * Use an alias so that the callback is called with an absolute address, * rather than an offset relative to to start + off_low. */ -memory_region_init_io(region, ops, piolist-opaque, piolist-name, +memory_region_init_io_ext(region, ops, NULL, piolist-opaque, + piolist-name, INT64_MAX); memory_region_init_alias(alias, piolist-name, region, start + off_low, off_high - off_low); diff --git a/memory.c b/memory.c index 5986532..80c7529 100644 --- a/memory.c +++ b/memory.c @@ -19,6 +19,7 @@ #include bitops.h #include kvm.h #include assert.h +#include hw/qdev.h #define WANT_EXEC_OBSOLETE #include exec-obsolete.h @@ -799,6 +800,7 @@ static bool memory_region_wrong_endianness(MemoryRegion *mr) #endif } +static MemoryRegionLifeOps nops; void memory_region_init(MemoryRegion *mr, const char *name, uint64_t size) @@ -809,6 +811,7 @@ void memory_region_init(MemoryRegion *mr, if (size == UINT64_MAX) { mr-size = int128_2_64(); } +mr-life_ops = nops; mr-addr = 0; mr-subpage = false; mr-enabled = true; @@ -931,6 +934,66 @@ static void memory_region_dispatch_write(MemoryRegion *mr, memory_region_write_accessor, mr); } +static void mr_object_get(MemoryRegion *mr) +{ +object_dynamic_cast_assert(OBJECT(mr-opaque), TYPE_DEVICE); +object_ref(OBJECT(mr-opaque)); +} + +static void mr_object_put(MemoryRegion *mr) +{ +object_unref(OBJECT(mr-opaque)); +} + +static MemoryRegionLifeOps obj_ops = { +.get = mr_object_get, +.put = mr_object_put, +}; + +static void mr_alias_get(MemoryRegion *mr) +{ +} + +static void mr_alias_put(MemoryRegion *mr) +{ +} + +static MemoryRegionLifeOps alias_ops = { +.get = mr_alias_get, +.put = mr_alias_put, +}; + +static void mr_nop_get(MemoryRegion *mr) +{ +} + +static void mr_nop_put(MemoryRegion *mr) +{ +} + +static MemoryRegionLifeOps nops = { +.get = mr_nop_get, +.put = mr_nop_put, +}; + +void memory_region_init_io_ext(MemoryRegion
[PATCH 06/15] memory: use refcnt to manage MemoryRegion
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using refcnt for mr, so we can separate mr's life cycle management from refered object. When mr-ref 0-1, inc the refered object. When mr-ref 1-0, dec the refered object. The refered object can be DeviceStae, another mr, or other opaque. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- memory.c | 18 ++ memory.h |5 + 2 files changed, 23 insertions(+), 0 deletions(-) diff --git a/memory.c b/memory.c index 80c7529..5dc8b59 100644 --- a/memory.c +++ b/memory.c @@ -811,6 +811,7 @@ void memory_region_init(MemoryRegion *mr, if (size == UINT64_MAX) { mr-size = int128_2_64(); } +atomic_set(mr-ref, 0); mr-life_ops = nops; mr-addr = 0; mr-subpage = false; @@ -1090,6 +1091,23 @@ static const MemoryRegionOps reservation_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; +void memory_region_get(MemoryRegion *mr) +{ +if (atomic_add_and_return(1, mr-ref) == 1) { +mr-life_ops-get(mr); +} +} + +void memory_region_put(MemoryRegion *mr) +{ +assert(atomic_read(mr-ref) 0); + +if (atomic_dec_and_test(mr-ref)) { +/* to fix, using call_rcu( ,release) */ +mr-life_ops-put(mr); +} +} + void memory_region_init_reservation(MemoryRegion *mr, const char *name, uint64_t size) diff --git a/memory.h b/memory.h index 8fb543b..740f018 100644 --- a/memory.h +++ b/memory.h @@ -18,6 +18,7 @@ #include stdint.h #include stdbool.h +#include qemu/atomic.h #include qemu-common.h #include cpu-common.h #include targphys.h @@ -26,6 +27,7 @@ #include ioport.h #include int128.h #include qemu-thread.h +#include qemu/reclaimer.h typedef struct MemoryRegionOps MemoryRegionOps; typedef struct MemoryRegionLifeOps MemoryRegionLifeOps; @@ -126,6 +128,7 @@ typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd; struct MemoryRegion { /* All fields are private - violators will be prosecuted */ const MemoryRegionOps *ops; +Atomic ref; MemoryRegionLifeOps *life_ops; void *opaque; MemoryRegion *parent; @@ -766,6 +769,8 @@ void memory_global_dirty_log_stop(void); void mtree_info(fprintf_function mon_printf, void *f); +void memory_region_get(MemoryRegion *mr); +void memory_region_put(MemoryRegion *mr); #endif #endif -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/15] atomic: introduce atomic operations
From: Liu Ping Fan pingf...@linux.vnet.ibm.com If out of global lock, we will be challenged by SMP in low level, so need atomic ops. This file is heavily copied from kernel. Currently, only x86 atomic ops included, and will be extended for other arch for future. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/atomic.h | 161 + 1 files changed, 161 insertions(+), 0 deletions(-) create mode 100644 include/qemu/atomic.h diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h new file mode 100644 index 000..8e1fc3e --- /dev/null +++ b/include/qemu/atomic.h @@ -0,0 +1,161 @@ +/* + * Simple interface for atomic operations. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef __QEMU_ATOMIC_H +#define __QEMU_ATOMIC_H 1 + +typedef struct Atomic { +int counter; +} Atomic; + + +#if defined(__i386__) || defined(__x86_64__) + +/** + * * atomic_read - read atomic variable + * * @v: pointer of type Atomic + ** + * * Atomically reads the value of @v. + * */ +static inline int atomic_read(const Atomic *v) +{ +return (*(volatile int *)(v)-counter); +} + +/** + * * atomic_set - set atomic variable + * * @v: pointer of type Atomic + ** @i: required value + * * + * * Atomically sets the value of @v to @i. + * */ +static inline void atomic_set(Atomic *v, int i) +{ +v-counter = i; +} + +/** + * * atomic_add - add integer to atomic variable + * * @i: integer value to add + ** @v: pointer of type Atomic + * * + * * Atomically adds @i to @v. + * */ +static inline void atomic_add(int i, Atomic *v) +{ +asm volatile(lock; addl %1,%0 + : +m (v-counter) + : ir (i)); +} + +/** + * * atomic_sub - subtract integer from atomic variable + * * @i: integer value to subtract + ** @v: pointer of type Atomic + * * + * * Atomically subtracts @i from @v. + * */ +static inline void atomic_sub(int i, Atomic *v) +{ +asm volatile(lock; subl %1,%0 + : +m (v-counter) + : ir (i)); +} + +/** + * * atomic_sub_and_test - subtract value from variable and test result + * * @i: integer value to subtract + ** @v: pointer of type Atomic + * * + * * Atomically subtracts @i from @v and returns + * * true if the result is zero, or false for all + ** other cases. + * */ +static inline int atomic_sub_and_test(int i, Atomic *v) +{ +unsigned char c; + +asm volatile(lock; subl %2,%0; sete %1 + : +m (v-counter), =qm (c) + : ir (i) : memory); +return c; +} + +/** + * * atomic_inc - increment atomic variable + * * @v: pointer of type Atomic + ** + * * Atomically increments @v by 1. + * */ +static inline void atomic_inc(Atomic *v) +{ +asm volatile(lock; incl %0 + : +m (v-counter)); +} + +/** + * * atomic_dec - decrement atomic variable + * * @v: pointer of type Atomic + ** + * * Atomically decrements @v by 1. + * */ +static inline void atomic_dec(Atomic *v) +{ +asm volatile(lock; decl %0 + : +m (v-counter)); +} + +/** + * * atomic_dec_and_test - decrement and test + * * @v: pointer of type Atomic + ** + * * Atomically decrements @v by 1 and + * * returns true if the result is 0, or false for all other + * * cases. + **/ +static inline int atomic_dec_and_test(Atomic *v) +{ +unsigned char c; + +asm volatile(lock; decl %0; sete %1 + : +m (v-counter), =qm (c) + : : memory); +return c != 0; +} + +/** + * * atomic_inc_and_test - increment and test + * * @v: pointer of type Atomic + ** + * * Atomically increments @v by 1 + * * and returns true if the result is zero, or false for all + * * other cases. + **/ +static inline int atomic_inc_and_test(Atomic *v) +{ +unsigned char c; + +asm volatile(lock; incl %0; sete %1 + : +m (v-counter), =qm (c) + : : memory); +return c != 0; +} + +static inline int atomic_add_and_return(int i, Atomic *v) +{ +int ret = i; + +asm volatile (lock; xaddl %0, %1 +: +r (ret), +m (v-counter) +: : memory, cc); + +return ret + i; +} +#endif + +#endif -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/15] memory: inc/dec mr's ref when adding/removing from mem view
From: Liu Ping Fan pingf...@linux.vnet.ibm.com memory_region_{add,del}_subregion will inc/dec mr's refcnt. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- memory.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/memory.c b/memory.c index 5dc8b59..2eaa2fc 100644 --- a/memory.c +++ b/memory.c @@ -1356,7 +1356,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, assert(!subregion-parent); subregion-parent = mr; subregion-addr = offset; - +memory_region_get(subregion); qemu_mutex_lock(mem_map_lock); QTAILQ_FOREACH(other, mr-subregions, subregions_link) { if (subregion-may_overlap || other-may_overlap) { @@ -1420,6 +1420,8 @@ void memory_region_del_subregion(MemoryRegion *mr, qemu_mutex_lock(mem_map_lock); QTAILQ_REMOVE(mr-subregions, subregion, subregions_link); memory_region_update_topology(mr); +/* mr may be still in use by reader of radix, must delay to release */ +memory_region_put(subregion); qemu_mutex_unlock(mem_map_lock); } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy
From: Liu Ping Fan pingf...@linux.vnet.ibm.com PhysMap contain the flatview and radix-tree view, they are snapshot of system topology and should be consistent. With PhysMap, we can swap the pointer when updating and achieve the atomic. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |8 memory.c | 33 - memory.h | 62 -- 3 files changed, 60 insertions(+), 43 deletions(-) diff --git a/exec.c b/exec.c index 0e29ef9..01b91b0 100644 --- a/exec.c +++ b/exec.c @@ -156,8 +156,6 @@ typedef struct PageDesc { #endif /* Size of the L2 (and L3, etc) page tables. */ -#define L2_BITS 10 -#define L2_SIZE (1 L2_BITS) #define P_L2_LEVELS \ (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1) @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask; static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -typedef struct PhysPageEntry PhysPageEntry; static MemoryRegionSection *phys_sections; static unsigned phys_sections_nb, phys_sections_nb_alloc; @@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; -struct PhysPageEntry { -uint16_t is_leaf : 1; - /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ -uint16_t ptr : 15; -}; /* Simple allocator for PhysPageEntry nodes */ static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; diff --git a/memory.c b/memory.c index 2eaa2fc..c7f2cfd 100644 --- a/memory.c +++ b/memory.c @@ -31,17 +31,6 @@ static bool global_dirty_log = false; static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners = QTAILQ_HEAD_INITIALIZER(memory_listeners); -typedef struct AddrRange AddrRange; - -/* - * Note using signed integers limits us to physical addresses at most - * 63 bits wide. They are needed for negative offsetting in aliases - * (large MemoryRegion::alias_offset). - */ -struct AddrRange { -Int128 start; -Int128 size; -}; static AddrRange addrrange_make(Int128 start, Int128 size) { @@ -197,28 +186,6 @@ static bool memory_region_ioeventfd_equal(MemoryRegionIoeventfd a, !memory_region_ioeventfd_before(b, a); } -typedef struct FlatRange FlatRange; -typedef struct FlatView FlatView; - -/* Range of memory in the global map. Addresses are absolute. */ -struct FlatRange { -MemoryRegion *mr; -target_phys_addr_t offset_in_region; -AddrRange addr; -uint8_t dirty_log_mask; -bool readable; -bool readonly; -}; - -/* Flattened global view of current active memory hierarchy. Kept in sorted - * order. - */ -struct FlatView { -FlatRange *ranges; -unsigned nr; -unsigned nr_allocated; -}; - typedef struct AddressSpace AddressSpace; typedef struct AddressSpaceOps AddressSpaceOps; diff --git a/memory.h b/memory.h index 740f018..357edd8 100644 --- a/memory.h +++ b/memory.h @@ -29,12 +29,72 @@ #include qemu-thread.h #include qemu/reclaimer.h +typedef struct AddrRange AddrRange; +typedef struct FlatRange FlatRange; +typedef struct FlatView FlatView; +typedef struct PhysPageEntry PhysPageEntry; +typedef struct PhysMap PhysMap; +typedef struct MemoryRegionSection MemoryRegionSection; typedef struct MemoryRegionOps MemoryRegionOps; typedef struct MemoryRegionLifeOps MemoryRegionLifeOps; typedef struct MemoryRegion MemoryRegion; typedef struct MemoryRegionPortio MemoryRegionPortio; typedef struct MemoryRegionMmio MemoryRegionMmio; +/* + * Note using signed integers limits us to physical addresses at most + * 63 bits wide. They are needed for negative offsetting in aliases + * (large MemoryRegion::alias_offset). + */ +struct AddrRange { +Int128 start; +Int128 size; +}; + +/* Range of memory in the global map. Addresses are absolute. */ +struct FlatRange { +MemoryRegion *mr; +target_phys_addr_t offset_in_region; +AddrRange addr; +uint8_t dirty_log_mask; +bool readable; +bool readonly; +}; + +/* Flattened global view of current active memory hierarchy. Kept in sorted + * order. + */ +struct FlatView { +FlatRange *ranges; +unsigned nr; +unsigned nr_allocated; +}; + +struct PhysPageEntry { +uint16_t is_leaf:1; + /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ +uint16_t ptr:15; +}; + +#define L2_BITS 10 +#define L2_SIZE (1 L2_BITS) +/* This is a multi-level map on the physical address space. + The bottom level has pointers to MemoryRegionSections. */ +struct PhysMap { +Atomic ref; +PhysPageEntry root; +PhysPageEntry (*phys_map_nodes)[L2_SIZE]; +unsigned phys_map_nodes_nb; +unsigned phys_map_nodes_nb_alloc; + +MemoryRegionSection *phys_sections; +unsigned phys_sections_nb; +unsigned phys_sections_nb_alloc; + +/* FlatView */ +FlatView views[2]; +}; + /* Must match *_DIRTY_FLAGS in cpu-all.h. To be replaced with dynamic *
[PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Flatview and radix view are all under the protection of pointer. And this make sure the change of them seem to be atomic! The mr accessed by radix-tree leaf or flatview will be reclaimed after the prev PhysMap not in use any longer Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c | 303 +++--- hw/vhost.c |2 +- hw/xen_pt.c |2 +- kvm-all.c |2 +- memory.c| 92 ++- memory.h|9 ++- vl.c|1 + xen-all.c |2 +- 8 files changed, 286 insertions(+), 127 deletions(-) diff --git a/exec.c b/exec.c index 01b91b0..97addb9 100644 --- a/exec.c +++ b/exec.c @@ -24,6 +24,7 @@ #include sys/mman.h #endif +#include qemu/atomic.h #include qemu-common.h #include cpu.h #include tcg.h @@ -35,6 +36,8 @@ #include qemu-timer.h #include memory.h #include exec-memory.h +#include qemu-thread.h +#include qemu/reclaimer.h #if defined(CONFIG_USER_ONLY) #include qemu.h #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -static MemoryRegionSection *phys_sections; -static unsigned phys_sections_nb, phys_sections_nb_alloc; static uint16_t phys_section_unassigned; static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; - -/* Simple allocator for PhysPageEntry nodes */ -static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; -static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; - #define PHYS_MAP_NODE_NIL (((uint16_t)~0) 1) -/* This is a multi-level map on the physical address space. - The bottom level has pointers to MemoryRegionSections. */ -static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; - +static QemuMutex cur_map_lock; +static PhysMap *cur_map; QemuMutex mem_map_lock; +static PhysMap *next_map; static void io_mem_init(void); static void memory_map_init(void); @@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index) #if !defined(CONFIG_USER_ONLY) -static void phys_map_node_reserve(unsigned nodes) +static void phys_map_node_reserve(PhysMap *map, unsigned nodes) { -if (phys_map_nodes_nb + nodes phys_map_nodes_nb_alloc) { +if (map-phys_map_nodes_nb + nodes map-phys_map_nodes_nb_alloc) { typedef PhysPageEntry Node[L2_SIZE]; -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16); -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc, - phys_map_nodes_nb + nodes); -phys_map_nodes = g_renew(Node, phys_map_nodes, - phys_map_nodes_nb_alloc); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2, +16); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc, + map-phys_map_nodes_nb + nodes); +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes, + map-phys_map_nodes_nb_alloc); } } -static uint16_t phys_map_node_alloc(void) +static uint16_t phys_map_node_alloc(PhysMap *map) { unsigned i; uint16_t ret; -ret = phys_map_nodes_nb++; +ret = map-phys_map_nodes_nb++; assert(ret != PHYS_MAP_NODE_NIL); -assert(ret != phys_map_nodes_nb_alloc); +assert(ret != map-phys_map_nodes_nb_alloc); for (i = 0; i L2_SIZE; ++i) { -phys_map_nodes[ret][i].is_leaf = 0; -phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; +map-phys_map_nodes[ret][i].is_leaf = 0; +map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; } return ret; } -static void phys_map_nodes_reset(void) -{ -phys_map_nodes_nb = 0; -} - - -static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, -target_phys_addr_t *nb, uint16_t leaf, +static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp, +target_phys_addr_t *index, +target_phys_addr_t *nb, +uint16_t leaf, int level) { PhysPageEntry *p; @@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, target_phys_addr_t step = (target_phys_addr_t)1 (level * L2_BITS); if (!lp-is_leaf lp-ptr == PHYS_MAP_NODE_NIL) { -lp-ptr = phys_map_node_alloc(); -p = phys_map_nodes[lp-ptr]; +lp-ptr = phys_map_node_alloc(map); +p = map-phys_map_nodes[lp-ptr]; if (level == 0) { for (i = 0; i L2_SIZE; i++) { p[i].is_leaf = 1; @@ -434,7 +426,7 @@ static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, } } } else
[PATCH 10/15] memory: change tcg related code to using PhysMap
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Change tcg code to use PhysMap. This is separated from the prev patch for review purpose. Should be merged into prev one. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c | 27 +-- 1 files changed, 21 insertions(+), 6 deletions(-) diff --git a/exec.c b/exec.c index 97addb9..8d0dea5 100644 --- a/exec.c +++ b/exec.c @@ -1923,6 +1923,7 @@ target_phys_addr_t memory_region_section_get_iotlb(CPUArchState *env, { target_phys_addr_t iotlb; CPUWatchpoint *wp; +PhysMap *map = cur_map_get(); if (memory_region_is_ram(section-mr)) { /* Normal RAM. */ @@ -1940,7 +1941,7 @@ target_phys_addr_t memory_region_section_get_iotlb(CPUArchState *env, and avoid full address decoding in every device. We can't use the high bits of pd for this because IO_MEM_ROMD uses these as a ram address. */ -iotlb = section - phys_sections; +iotlb = section - map-phys_sections; iotlb += memory_region_section_addr(section, paddr); } @@ -1956,6 +1957,7 @@ target_phys_addr_t memory_region_section_get_iotlb(CPUArchState *env, } } } +physmap_put(map); return iotlb; } @@ -3185,7 +3187,12 @@ static uint16_t dummy_section(PhysMap *map, MemoryRegion *mr) MemoryRegion *iotlb_to_region(target_phys_addr_t index) { -return phys_sections[index ~TARGET_PAGE_MASK].mr; +MemoryRegion *ret; +PhysMap *map = cur_map_get(); + +ret = map-phys_sections[index ~TARGET_PAGE_MASK].mr; +physmap_put(map); +return ret; } static void io_mem_init(void) @@ -3946,13 +3953,14 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val) { uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } io_mem_write(section-mr, addr, val, 4); } else { @@ -3972,19 +3980,21 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val) } } } +physmap_put(map); } void stq_phys_notdirty(target_phys_addr_t addr, uint64_t val) { uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } #ifdef TARGET_WORDS_BIGENDIAN io_mem_write(section-mr, addr, val 32, 4); @@ -3999,6 +4009,7 @@ void stq_phys_notdirty(target_phys_addr_t addr, uint64_t val) + memory_region_section_addr(section, addr)); stq_p(ptr, val); } +physmap_put(map); } /* warning: addr must be aligned */ @@ -4008,12 +4019,13 @@ static inline void stl_phys_internal(target_phys_addr_t addr, uint32_t val, uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } #if defined(TARGET_WORDS_BIGENDIAN) if (endian == DEVICE_LITTLE_ENDIAN) { @@ -4050,6 +4062,7 @@ static inline void stl_phys_internal(target_phys_addr_t addr, uint32_t val, (0xff ~CODE_DIRTY_FLAG)); } } +physmap_put(map); } void stl_phys(target_phys_addr_t addr, uint32_t val) @@ -4081,12 +4094,13 @@ static inline void stw_phys_internal(target_phys_addr_t addr, uint32_t val, uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } #if defined(TARGET_WORDS_BIGENDIAN) if (endian == DEVICE_LITTLE_ENDIAN) { @@ -4123,6 +4137,7 @@ static inline void stw_phys_internal(target_phys_addr_t addr, uint32_t val, (0xff ~CODE_DIRTY_FLAG)); } } +physmap_put(map); } void
[PATCH 11/15] lock: introduce global lock for device tree
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- cpus.c | 12 main-loop.h |3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..a734b36 100644 --- a/cpus.c +++ b/cpus.c @@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void) } #endif /* _WIN32 */ +QemuMutex qemu_device_tree_mutex; QemuMutex qemu_global_mutex; static QemuCond qemu_io_proceeded_cond; static bool iothread_requesting_mutex; @@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void) qemu_cond_init(qemu_work_cond); qemu_cond_init(qemu_io_proceeded_cond); qemu_mutex_init(qemu_global_mutex); +qemu_mutex_init(qemu_device_tree_mutex); qemu_thread_get_self(io_thread); } @@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void) qemu_mutex_unlock(qemu_global_mutex); } +void qemu_lock_devtree(void) +{ +qemu_mutex_lock(qemu_device_tree_mutex); +} + +void qemu_unlock_devtree(void) +{ +qemu_mutex_unlock(qemu_device_tree_mutex); +} + static int all_vcpus_paused(void) { CPUArchState *penv = first_cpu; diff --git a/main-loop.h b/main-loop.h index dce1cd9..17e959a 100644 --- a/main-loop.h +++ b/main-loop.h @@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void); */ void qemu_mutex_unlock_iothread(void); +void qemu_lock_devtree(void); +void qemu_unlock_devtree(void); + /* internal interfaces */ void qemu_fd_register(int fd); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/15] qdev: using devtree lock to protect device's accessing
From: Liu Ping Fan pingf...@linux.vnet.ibm.com lock: qemu_device_tree_mutex competitors: --device_del(destruction of device will be postphoned until unplug ack from guest), --pci hot-unplug --iteration (qdev_reset_all) --device_add Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/pci-hotplug.c |4 hw/qdev-monitor.c | 17 - hw/qdev.c |2 ++ 3 files changed, 22 insertions(+), 1 deletions(-) diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c index e7fb780..33a9dfe 100644 --- a/hw/pci-hotplug.c +++ b/hw/pci-hotplug.c @@ -265,9 +265,11 @@ static int pci_device_hot_remove(Monitor *mon, const char *pci_addr) return -1; } +qemu_lock_devtree(); d = pci_find_device(pci_find_root_bus(dom), bus, PCI_DEVFN(slot, 0)); if (!d) { monitor_printf(mon, slot %d empty\n, slot); +qemu_unlock_devtree(); return -1; } @@ -275,9 +277,11 @@ static int pci_device_hot_remove(Monitor *mon, const char *pci_addr) if (error_is_set(local_err)) { monitor_printf(mon, %s\n, error_get_pretty(local_err)); error_free(local_err); +qemu_unlock_devtree(); return -1; } +qemu_unlock_devtree(); return 0; } diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c index 7915b45..2d47fe0 100644 --- a/hw/qdev-monitor.c +++ b/hw/qdev-monitor.c @@ -429,14 +429,18 @@ DeviceState *qdev_device_add(QemuOpts *opts) /* find bus */ path = qemu_opt_get(opts, bus); + +qemu_lock_devtree(); if (path != NULL) { bus = qbus_find(path); if (!bus) { +qemu_unlock_devtree(); return NULL; } if (strcmp(object_get_typename(OBJECT(bus)), k-bus_type) != 0) { qerror_report(QERR_BAD_BUS_FOR_DEVICE, driver, object_get_typename(OBJECT(bus))); +qemu_unlock_devtree(); return NULL; } } else { @@ -444,11 +448,13 @@ DeviceState *qdev_device_add(QemuOpts *opts) if (!bus) { qerror_report(QERR_NO_BUS_FOR_DEVICE, driver, k-bus_type); +qemu_unlock_devtree(); return NULL; } } if (qdev_hotplug !bus-allow_hotplug) { qerror_report(QERR_BUS_NO_HOTPLUG, bus-name); +qemu_unlock_devtree(); return NULL; } @@ -466,6 +472,7 @@ DeviceState *qdev_device_add(QemuOpts *opts) } if (qemu_opt_foreach(opts, set_property, qdev, 1) != 0) { qdev_free(qdev); +qemu_unlock_devtree(); return NULL; } if (qdev-id) { @@ -478,6 +485,8 @@ DeviceState *qdev_device_add(QemuOpts *opts) OBJECT(qdev), NULL); g_free(name); } +qemu_unlock_devtree(); + if (qdev_init(qdev) 0) { qerror_report(QERR_DEVICE_INIT_FAILED, driver); return NULL; @@ -600,13 +609,19 @@ void qmp_device_del(const char *id, Error **errp) { DeviceState *dev; +/* protect against unplug ack from guest, where we really remove device + * from system + */ +qemu_lock_devtree(); dev = qdev_find_recursive(sysbus_get_default(), id); if (NULL == dev) { error_set(errp, QERR_DEVICE_NOT_FOUND, id); +qemu_unlock_devtree(); return; } - +/* Just remove from system, and drop refcnt there*/ qdev_unplug(dev, errp); +qemu_unlock_devtree(); } void qdev_machine_init(void) diff --git a/hw/qdev.c b/hw/qdev.c index af54467..17525fe 100644 --- a/hw/qdev.c +++ b/hw/qdev.c @@ -230,7 +230,9 @@ static int qbus_reset_one(BusState *bus, void *opaque) void qdev_reset_all(DeviceState *dev) { +qemu_lock_devtree(); qdev_walk_children(dev, qdev_reset_one, qbus_reset_one, NULL); +qemu_unlock_devtree(); } void qbus_reset_all_fn(void *opaque) -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views
From: Liu Ping Fan pingf...@linux.vnet.ibm.com When guest confirm the removal of device, we should --unmap from MemoryRegion view --isolated from device tree view Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/acpi_piix4.c |4 ++-- hw/pci.c| 13 - hw/pci.h|2 ++ hw/qdev.c | 28 hw/qdev.h |3 ++- 5 files changed, 46 insertions(+), 4 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 0aace60..c209ff7 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -305,8 +305,8 @@ static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots) if (pc-no_hotplug) { slot_free = false; } else { -object_unparent(OBJECT(dev)); -qdev_free(qdev); +/* refcnt will be decreased */ +qdev_unplug_complete(qdev, NULL); } } } diff --git a/hw/pci.c b/hw/pci.c index 99a4304..2095abf 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -856,12 +856,22 @@ static int pci_unregister_device(DeviceState *dev) if (ret) return ret; -pci_unregister_io_regions(pci_dev); pci_del_option_rom(pci_dev); do_pci_unregister_device(pci_dev); return 0; } +static void pci_unmap_device(DeviceState *dev) +{ +PCIDevice *pci_dev = PCI_DEVICE(dev); +PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev); + +pci_unregister_io_regions(pci_dev); +if (pc-unmap) { +pc-unmap(pci_dev); +} +} + void pci_register_bar(PCIDevice *pci_dev, int region_num, uint8_t type, MemoryRegion *memory) { @@ -2022,6 +2032,7 @@ static void pci_device_class_init(ObjectClass *klass, void *data) DeviceClass *k = DEVICE_CLASS(klass); k-init = pci_qdev_init; k-unplug = pci_unplug_device; +k-unmap = pci_unmap_device; k-exit = pci_unregister_device; k-bus_type = TYPE_PCI_BUS; k-props = pci_props; diff --git a/hw/pci.h b/hw/pci.h index 79d38fd..1c5b909 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -145,6 +145,8 @@ typedef struct PCIDeviceClass { DeviceClass parent_class; int (*init)(PCIDevice *dev); +void (*unmap)(PCIDevice *dev); + PCIUnregisterFunc *exit; PCIConfigReadFunc *config_read; PCIConfigWriteFunc *config_write; diff --git a/hw/qdev.c b/hw/qdev.c index 17525fe..530eabe 100644 --- a/hw/qdev.c +++ b/hw/qdev.c @@ -104,6 +104,14 @@ void qdev_set_parent_bus(DeviceState *dev, BusState *bus) bus_add_child(bus, dev); } +static void qdev_unset_parent(DeviceState *dev) +{ +BusState *b = dev-parent_bus; + +object_unparent(OBJECT(dev)); +bus_remove_child(b, dev); +} + /* Create a new device. This only initializes the device state structure and allows properties to be set. qdev_init should be called to initialize the actual device emulation. */ @@ -194,6 +202,26 @@ void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id, dev-alias_required_for_version = required_for_version; } +static int qdev_unmap(DeviceState *dev) +{ +DeviceClass *dc = DEVICE_GET_CLASS(dev); +if (dc-unmap) { +dc-unmap(dev); +} +return 0; +} + +void qdev_unplug_complete(DeviceState *dev, Error **errp) +{ +/* isolate from mem view */ +qdev_unmap(dev); +qemu_lock_devtree(); +/* isolate from device tree */ +qdev_unset_parent(dev); +qemu_unlock_devtree(); +object_unref(OBJECT(dev)); +} + void qdev_unplug(DeviceState *dev, Error **errp) { DeviceClass *dc = DEVICE_GET_CLASS(dev); diff --git a/hw/qdev.h b/hw/qdev.h index f4683dc..705635a 100644 --- a/hw/qdev.h +++ b/hw/qdev.h @@ -47,7 +47,7 @@ typedef struct DeviceClass { /* callbacks */ void (*reset)(DeviceState *dev); - +void (*unmap)(DeviceState *dev); /* device state */ const VMStateDescription *vmsd; @@ -162,6 +162,7 @@ void qdev_init_nofail(DeviceState *dev); void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id, int required_for_version); void qdev_unplug(DeviceState *dev, Error **errp); +void qdev_unplug_complete(DeviceState *dev, Error **errp); void qdev_free(DeviceState *dev); int qdev_simple_unplug_cb(DeviceState *dev); void qdev_machine_creation_done(void); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/15] qom: object_unref call reclaimer
From: Liu Ping Fan pingf...@linux.vnet.ibm.com iohandler/bh/timer may use DeviceState when its refcnt=0, postpone the reclaimer till they have done with it. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- qom/object.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/qom/object.c b/qom/object.c index 822bdb7..1452b1b 100644 --- a/qom/object.c +++ b/qom/object.c @@ -23,6 +23,8 @@ #include qbool.h #include qint.h #include qstring.h +#include hw/qdev.h +#include qemu/reclaimer.h #define MAX_INTERFACES 32 @@ -646,7 +648,12 @@ void object_unref(Object *obj) { g_assert(atomic_read(obj-ref) 0); if (atomic_dec_and_test(obj-ref)) { -object_finalize(obj); +/* fixme, maybe introduce obj-finalze to make this more elegant */ +if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) { +qemu_reclaimer_enqueue(obj, object_finalize); +} else { +object_finalize(obj); +} } } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/15] e1000: using new interface--unmap to unplug
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/e1000.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/hw/e1000.c b/hw/e1000.c index 4573f13..fa71455 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -1192,6 +1192,13 @@ e1000_cleanup(VLANClientState *nc) s-nic = NULL; } +static void +pci_e1000_unmap(PCIDevice *p) +{ +/* DO NOT FREE anything!until refcnt=0 */ +/* isolate from memory view */ +} + static int pci_e1000_uninit(PCIDevice *dev) { @@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void *data) PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); k-init = pci_e1000_init; +k-unmap = pci_e1000_unmap; k-exit = pci_e1000_uninit; k-romfile = pxe-e1000.rom; k-vendor_id = PCI_VENDOR_ID_INTEL; -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: PPC: E500: Implement MMU notifiers
On Tue, Aug 07, 2012 at 12:57:14PM +0200, Alexander Graf wrote: The e500 target has lived without mmu notifiers ever since it got introduced, but fails for the user space check on them with hugetlbfs. Ironically that user space check isn't necessary any more since David Gibson's fix for the hugetlbfs bug went in (90481622, hugepages: fix use after free bug in quota handling). So on sufficiently recent kernels you can just remove the userspace check. Implementing mmu-notifiers is a good thing for other reasons though. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: PPC: Add cache flush on page map
On 07.08.2012, at 23:01, Scott Wood scottw...@freescale.com wrote: On 08/07/2012 05:57 AM, Alexander Graf wrote: When we map a page that wasn't icache cleared before, do so when first mapping it in KVM using the same information bits as the Linux mapping logic. That way we are 100% sure that any page we map does not have stale entries in the icache. What we really would need is a method to flush the icache only when we actually map a page executable for the guest. But with the current abstraction that is hard to do, and this way we don't have a huge performance penalty, so better be safe now and micro optimize things later. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/mm/mem.c |1 + 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ed75bc9..c0a2fc1 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -33,6 +33,7 @@ #include asm/kvm_asm.h #include asm/processor.h #include asm/page.h +#include asm/cacheflush.h #define KVM_MAX_VCPUSNR_CPUS #define KVM_MAX_VCORESNR_CPUS @@ -550,5 +551,14 @@ struct kvm_vcpu_arch { #define KVM_MMIO_REG_FQPR0x0060 #define __KVM_HAVE_ARCH_WQP +#define __KVM_HAVE_ARCH_MAP_PAGE +static inline void kvm_arch_map_page(struct page *page) +{ +/* Need to invalidate the icache for new pages */ +if (!test_bit(PG_arch_1, page-flags)) { +flush_dcache_icache_page(page); +set_bit(PG_arch_1, page-flags); +} +} Shouldn't this test CPU_FTR_COHERENT_ICACHE? flush_icache_range checks for it a bit further down the code stream. I mostly modeled things the same way as set_pre_filter. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: PPC: E500: Implement MMU notifiers
On 08.08.2012, at 05:31, Paul Mackerras pau...@samba.org wrote: On Tue, Aug 07, 2012 at 12:57:14PM +0200, Alexander Graf wrote: The e500 target has lived without mmu notifiers ever since it got introduced, but fails for the user space check on them with hugetlbfs. Ironically that user space check isn't necessary any more since David Gibson's fix for the hugetlbfs bug went in (90481622, hugepages: fix use after free bug in quota handling). So on sufficiently recent kernels you can just remove the userspace check. Implementing mmu-notifiers is a good thing for other reasons though. Yeah, it's probably best to just require mmu notifiers from every target. It would however be good to have a CAP nevertheless (or change the semantics of SYNC_MMU) so that we can run HV KVM on 970 without changes to QEMU. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V7 0/2] Improve virtio-blk performance
Hi, all Changes in v7: - Using vbr-flags to trace request type - Dropped unnecessary struct virtio_blk *vblk parameter - Reuse struct virtblk_req in bio done function - Added performance data on normal SATA device and the reason why make it optional Fio test shows bio-based IO path gives the following performance improvement: 1) Ramdisk device With bio-based IO path, sequential read/write, random read/write IOPS boost : 28%, 24%, 21%, 16% Latency improvement: 32%, 17%, 21%, 16% 2) Fusion IO device With bio-based IO path, sequential read/write, random read/write IOPS boost : 11%, 11%, 13%, 10% Latency improvement: 10%, 10%, 12%, 10% 3) Normal SATA device With bio-based IO path, sequential read/write, random read/write IOPS boost : -10%, -10%, 4.4%, 0.5% Latency improvement: -12%, -15%, 2.5%, 0.8% Asias He (2): virtio-blk: Add bio-based IO path for virtio-blk virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path drivers/block/virtio_blk.c | 301 +++-- 1 file changed, 264 insertions(+), 37 deletions(-) -- 1.7.11.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V7 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
We need to support both REQ_FLUSH and REQ_FUA for bio based path since it does not get the sequencing of REQ_FUA into REQ_FLUSH that request based drivers can request. REQ_FLUSH is emulated by: A) If the bio has no data to write: 1. Send VIRTIO_BLK_T_FLUSH to device, 2. In the flush I/O completion handler, finish the bio B) If the bio has data to write: 1. Send VIRTIO_BLK_T_FLUSH to device 2. In the flush I/O completion handler, send the actual write data to device 3. In the write I/O completion handler, finish the bio REQ_FUA is emulated by: 1. Send the actual write data to device 2. In the write I/O completion handler, send VIRTIO_BLK_T_FLUSH to device 3. In the flush I/O completion handler, finish the bio Changes in v7: - Using vbr-flags to trace request type - Dropped unnecessary struct virtio_blk *vblk parameter - Reuse struct virtblk_req in bio done function Cahnges in v6: - Reworked REQ_FLUSH and REQ_FUA emulatation order Cc: Rusty Russell ru...@rustcorp.com.au Cc: Jens Axboe ax...@kernel.dk Cc: Christoph Hellwig h...@lst.de Cc: Tejun Heo t...@kernel.org Cc: Shaohua Li s...@kernel.org Cc: Michael S. Tsirkin m...@redhat.com Cc: kvm@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Signed-off-by: Asias He as...@redhat.com --- drivers/block/virtio_blk.c | 272 +++-- 1 file changed, 188 insertions(+), 84 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 95cfeed..2edfb5c 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -58,10 +58,20 @@ struct virtblk_req struct bio *bio; struct virtio_blk_outhdr out_hdr; struct virtio_scsi_inhdr in_hdr; + struct work_struct work; + struct virtio_blk *vblk; + int flags; u8 status; struct scatterlist sg[]; }; +enum { + VBLK_IS_FLUSH = 1, + VBLK_REQ_FLUSH = 2, + VBLK_REQ_DATA = 4, + VBLK_REQ_FUA= 8, +}; + static inline int virtblk_result(struct virtblk_req *vbr) { switch (vbr-status) { @@ -74,9 +84,133 @@ static inline int virtblk_result(struct virtblk_req *vbr) } } -static inline void virtblk_request_done(struct virtio_blk *vblk, - struct virtblk_req *vbr) +static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk, + gfp_t gfp_mask) +{ + struct virtblk_req *vbr; + + vbr = mempool_alloc(vblk-pool, gfp_mask); + if (vbr use_bio) + sg_init_table(vbr-sg, vblk-sg_elems); + + vbr-vblk = vblk; + + return vbr; +} + +static void virtblk_add_buf_wait(struct virtio_blk *vblk, +struct virtblk_req *vbr, +unsigned long out, +unsigned long in) +{ + DEFINE_WAIT(wait); + + for (;;) { + prepare_to_wait_exclusive(vblk-queue_wait, wait, + TASK_UNINTERRUPTIBLE); + + spin_lock_irq(vblk-disk-queue-queue_lock); + if (virtqueue_add_buf(vblk-vq, vbr-sg, out, in, vbr, + GFP_ATOMIC) 0) { + spin_unlock_irq(vblk-disk-queue-queue_lock); + io_schedule(); + } else { + virtqueue_kick(vblk-vq); + spin_unlock_irq(vblk-disk-queue-queue_lock); + break; + } + + } + + finish_wait(vblk-queue_wait, wait); +} + +static inline void virtblk_add_req(struct virtblk_req *vbr, + unsigned int out, unsigned int in) +{ + struct virtio_blk *vblk = vbr-vblk; + + spin_lock_irq(vblk-disk-queue-queue_lock); + if (unlikely(virtqueue_add_buf(vblk-vq, vbr-sg, out, in, vbr, + GFP_ATOMIC) 0)) { + spin_unlock_irq(vblk-disk-queue-queue_lock); + virtblk_add_buf_wait(vblk, vbr, out, in); + return; + } + virtqueue_kick(vblk-vq); + spin_unlock_irq(vblk-disk-queue-queue_lock); +} + +static int virtblk_bio_send_flush(struct virtblk_req *vbr) +{ + unsigned int out = 0, in = 0; + + vbr-flags |= VBLK_IS_FLUSH; + vbr-out_hdr.type = VIRTIO_BLK_T_FLUSH; + vbr-out_hdr.sector = 0; + vbr-out_hdr.ioprio = 0; + sg_set_buf(vbr-sg[out++], vbr-out_hdr, sizeof(vbr-out_hdr)); + sg_set_buf(vbr-sg[out + in++], vbr-status, sizeof(vbr-status)); + + virtblk_add_req(vbr, out, in); + + return 0; +} + +static int virtblk_bio_send_data(struct virtblk_req *vbr) { + struct virtio_blk *vblk = vbr-vblk; + unsigned int num, out = 0, in = 0; + struct bio *bio = vbr-bio; + + vbr-flags = ~VBLK_IS_FLUSH; + vbr-out_hdr.type = 0;
[PATCH V7 1/2] virtio-blk: Add bio-based IO path for virtio-blk
This patch introduces bio-based IO path for virtio-blk. Compared to request-based IO path, bio-based IO path uses driver provided -make_request_fn() method to bypasses the IO scheduler. It handles the bio to device directly without allocating a request in block layer. This reduces the IO path in guest kernel to achieve high IOPS and lower latency. The downside is that guest can not use the IO scheduler to merge and sort requests. However, this is not a big problem if the backend disk in host side uses faster disk device. When the bio-based IO path is not enabled, virtio-blk still uses the original request-based IO path, no performance difference is observed. Using a slow device e.g. normal SATA disk, the bio-based IO path for sequential read and write are slower than req-based IO path due to lack of merge in guest kernel. So we make the bio-based path optional. Performance evaluation: - 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using kvm tool. Short version: With bio-based IO path, sequential read/write, random read/write IOPS boost : 28%, 24%, 21%, 16% Latency improvement: 32%, 17%, 21%, 16% Long version: With bio-based IO path: seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30 clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35 clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39 clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45 cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954 cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137 cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648 cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375 With request-based IO path: seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49 clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15 clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27 clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68 cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622 cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959 cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082 cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using kvm tool. Short version: With bio-based IO path, sequential read/write, random read/write IOPS boost : 11%, 11%, 13%, 10% Latency improvement: 10%, 10%, 12%, 10% Long Version: With bio-based IO path: read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29 clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80 clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07 clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25 cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517 cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604 cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612 cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105 With request-based IO path: read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84 clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64 clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55 clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95 cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641 cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689 cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223 cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409 3) Fio test is performed in a 8 vcpu guest with normal
Re: KVM segfaults with 3.5 while installing ubuntu 12.04
On Wed, Aug 08, 2012 at 07:51:07AM +0200, Stefan Priebe wrote: Any news? Was this applied upstream? Kevin is ill. He has asked me to review and test patches in his absence. When he gets back later this week this will get picked up (and included in QEMU 1.2). Here is the tree, it includes this patch: https://github.com/stefanha/qemu/commits/for-kevin Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.
On Tue, 7 Aug 2012 21:00:59 + Blue Swirl blauwir...@gmail.com wrote: diff --git a/hw/s390x/css.c b/hw/s390x/css.c new file mode 100644 index 000..7941c44 --- /dev/null +++ b/hw/s390x/css.c @@ -0,0 +1,440 @@ +/* + * Channel subsystem base support. + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#include qemu-thread.h +#include qemu-queue.h +#include hw/qdev.h +#include kvm.h +#include cpu.h +#include ioinst.h +#include css.h + +struct chp_info { CamelCase, please. OK. +uint8_t in_use; +uint8_t type; +}; + +static struct chp_info chpids[MAX_CSSID + 1][MAX_CHPID + 1]; + +static css_subch_cb_func css_subch_cb; Probably these can be put to a container structure which can be passed around. Still trying to come up with a good model for that. +case CCW_CMD_SENSE_ID: +{ +uint8_t sense_bytes[256]; + +/* Sense ID information is device specific. */ +memcpy(sense_bytes, sch-id, sizeof(sense_bytes)); +if (check_len) { +if (ccw-count != sizeof(sense_bytes)) { +ret = -EINVAL; +break; +} +} +len = MIN(ccw-count, sizeof(sense_bytes)); +/* + * Only indicate 0xff in the first sense byte if we actually + * have enough place to store at least bytes 0-3. + */ +if (len = 4) { +stb_phys(ccw-cda, 0xff); +} else { +stb_phys(ccw-cda, 0); +} +i = 1; +for (i = 1; i len - 1; i++) { +stb_phys(ccw-cda + i, sense_bytes[i]); +} cpu_physical_memory_write() Hm, what's wrong with storing byte-by-byte? +sch-curr_status.scsw.count = ccw-count - len; +ret = 0; +break; +} +case CCW_CMD_TIC: +if (sch-last_cmd-cmd_code == CCW_CMD_TIC) { +ret = -EINVAL; +break; +} +if (ccw-flags (CCW_FLAG_CC | CCW_FLAG_DC)) { +ret = -EINVAL; +break; +} +sch-channel_prog = qemu_get_ram_ptr(ccw-cda); +ret = sch-channel_prog ? -EAGAIN : -EFAULT; +break; +default: +if (sch-ccw_cb) { +/* Handle device specific commands. */ +ret = sch-ccw_cb(sch, ccw); +} else { +ret = -EOPNOTSUPP; +} +break; +} +sch-last_cmd = ccw; +if (ret == 0) { +if (ccw-flags CCW_FLAG_CC) { +sch-channel_prog += 8; +ret = -EAGAIN; +} +} + +return ret; diff --git a/hw/s390x/css.h b/hw/s390x/css.h new file mode 100644 index 000..b8a95cc --- /dev/null +++ b/hw/s390x/css.h @@ -0,0 +1,62 @@ +/* + * Channel subsystem structures and definitions. + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#ifndef CSS_H +#define CSS_H + +#include ioinst.h + +/* Channel subsystem constants. */ +#define MAX_SCHID 65535 +#define MAX_SSID 3 +#define MAX_CSSID 254 /* 255 is reserved */ +#define MAX_CHPID 255 + +#define MAX_CIWS 8 + +struct senseid { SenseID OK. +/* common part */ +uint32_t reserved:8;/* always 0x'FF' */ The standard syntax calls for 'unsigned' instead of uint32_t for bit fields. But bit fields are not very well defined, it's better to avoid them. Well, the equivalent Linux structure also looks like that :) But I can switch this to a uint8_t/uint16_t structure. +uint32_t cu_type:16;/* control unit type */ +uint32_t cu_model:8;/* control unit model */ +uint32_t dev_type:16; /* device type */ +uint32_t dev_model:8; /* device model */ +uint32_t unused:8; /* padding byte */ +/* extended part */ +uint32_t ciw[MAX_CIWS]; /* variable # of CIWs */ +}; + diff --git a/target-s390x/ioinst.h b/target-s390x/ioinst.h new file mode 100644 index 000..79628b4 --- /dev/null +++ b/target-s390x/ioinst.h @@ -0,0 +1,173 @@ +/* + * S/390 channel I/O instructions + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. +*/ + +#ifndef IOINST_S390X_H +#define IOINST_S390X_H + +/* + * Channel I/O
Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.
On 7 August 2012 15:52, Cornelia Huck cornelia.h...@de.ibm.com wrote: +static void sch_handle_clear_func(SubchDev *sch) +{ +struct pmcw *p = sch-curr_status.pmcw; +struct scsw *s = sch-curr_status.scsw; +int path; + +/* Path management: In our simple css, we always choose the only path. */ +path = 0x80; + +/* Reset values prior to 'issueing the clear signal'. */ +p-lpum = 0; This is unnecessary since we're going to set p-lpum to something else about ten lines later, right? +p-pom = 0xff; +s-pno = 0; + +/* We always 'attempt to issue the clear signal', and we always succeed. */ +sch-orb = NULL; +sch-channel_prog = NULL; +sch-last_cmd = NULL; +s-actl = ~SCSW_ACTL_CLEAR_PEND; +s-stctl |= SCSW_STCTL_STATUS_PEND; + +s-dstat = 0; +s-cstat = 0; +p-lpum = path; + +} -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 3/5] s390: Add new channel I/O based virtio transport.
On Tue, 7 Aug 2012 20:47:22 + Blue Swirl blauwir...@gmail.com wrote: diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c new file mode 100644 index 000..8a90c3a --- /dev/null +++ b/hw/s390x/virtio-ccw.c @@ -0,0 +1,962 @@ +/* + * virtio ccw target implementation + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#include hw/hw.h +#include block.h +#include blockdev.h +#include sysemu.h +#include net.h +#include monitor.h +#include qemu-thread.h +#include ../virtio.h +#include ../virtio-serial.h +#include ../virtio-net.h +#include ../sysbus.h hw/virtio... for the above OK. +#include bitops.h + +#include ioinst.h +#include css.h +#include virtio-ccw.h + +static const TypeInfo virtio_ccw_bus_info = { +.name = TYPE_VIRTIO_CCW_BUS, +.parent = TYPE_BUS, +.instance_size = sizeof(VirtioCcwBus), +}; + +static const VirtIOBindings virtio_ccw_bindings; + +typedef struct sch_entry { +SubchDev *sch; +QLIST_ENTRY(sch_entry) entry; +} sch_entry; SubchEntry, see CODING_STYLE. Also other struct and typedef names below. + +QLIST_HEAD(subch_list, sch_entry); static, but please put this to a structure that is passed around instead. + +typedef struct devno_entry { +uint16_t devno; +QLIST_ENTRY(devno_entry) entry; +} devno_entry; + +QLIST_HEAD(devno_list, devno_entry); Ditto + +struct subch_set { +struct subch_list *s_list[256]; +struct devno_list *d_list[256]; +}; + +struct css_set { +struct subch_set *set[MAX_SSID + 1]; +}; + +static struct css_set *channel_subsys[MAX_CSSID + 1]; OK, will try to come up with some kind of structure for this and CamelCasify it. + +VirtIODevice *virtio_ccw_get_vdev(SubchDev *sch) +{ +VirtIODevice *vdev = NULL; + +if (sch-driver_data) { +vdev = ((VirtioCcwData *)sch-driver_data)-vdev; +} +return vdev; +} + +VirtioCcwBus *virtio_ccw_bus_init(void) +{ +VirtioCcwBus *bus; +BusState *_bus; Please avoid identifiers with leading underscores. OK. +DeviceState *dev; + +css_set_subch_cb(virtio_ccw_find_subch); + +/* Create bridge device */ +dev = qdev_create(NULL, virtio-ccw-bridge); +qdev_init_nofail(dev); + +/* Create bus on bridge device */ +_bus = qbus_create(TYPE_VIRTIO_CCW_BUS, dev, virtio-ccw); +bus = DO_UPCAST(VirtioCcwBus, bus, _bus); + +/* Enable hotplugging */ +_bus-allow_hotplug = 1; + +return bus; +} + +struct vq_info_block { +uint64_t queue; +uint16_t num; +} QEMU_PACKED; + +struct vq_config_block { +uint16_t index; +uint16_t num; +} QEMU_PACKED; Aren't these KVM structures? They should be defined in a KVM header file file in linux-headers. Not really, virtio-ccw isn't tied to kvm. I see this more as command blocks that are specific to the control unit - like something that would be defined in an attachment specification for a classic s390 device (and in the virtio spec in this case) and modeled as C structures here. +case CCW_CMD_WRITE_CONF: +if (check_len) { +if (ccw-count data-vdev-config_len) { +ret = -EINVAL; +break; +} +} +len = MIN(ccw-count, data-vdev-config_len); +config = qemu_get_ram_ptr(ccw-cda); Please use cpu_physical_memory_read() (or DMA versions) instead of this + memcpy(). Will check. +if (!config) { +ret = -EFAULT; +} else { +memcpy(data-vdev-config, config, len); +if (data-vdev-set_config) { +data-vdev-set_config(data-vdev, data-vdev-config); +} +sch-curr_status.scsw.count = ccw-count - len; +ret = 0; +} +break; +case CCW_CMD_READ_VQ_CONF: +if (check_len) { +if (ccw-count != sizeof(vq_config)) { +ret = -EINVAL; +break; +} +} else if (ccw-count sizeof(vq_config)) { +/* Can't execute command. */ +ret = -EINVAL; +break; +} +if (!qemu_get_ram_ptr(ccw-cda)) { +ret = -EFAULT; +} else { +vq_config.index = lduw_phys(ccw-cda); lduw_{b,l}e_phys() +vq_config.num = virtio_queue_get_num(data-vdev, vq_config.index); +stw_phys(ccw-cda + sizeof(vq_config.index), vq_config.num); stw_{b,l]e_phys(), likewise elsewhere. Will check as well. +sch-curr_status.scsw.count = ccw-count -
Re: KVM segfaults with 3.5 while installing ubuntu 12.04
ah OK - thanks. Will there be a fixed 1.1.2 as well? Stefan Am 08.08.2012 10:06, schrieb Stefan Hajnoczi: On Wed, Aug 08, 2012 at 07:51:07AM +0200, Stefan Priebe wrote: Any news? Was this applied upstream? Kevin is ill. He has asked me to review and test patches in his absence. When he gets back later this week this will get picked up (and included in QEMU 1.2). Here is the tree, it includes this patch: https://github.com/stefanha/qemu/commits/for-kevin Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote: On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote: Peter has started to fix up this naming mess in qemu. I guess we should do the same for the kernel (except for ABIs) and document it, because it keeps generating confusion. Ok so our current situation is that the XICS and MPIC are emulated in userspace entirely but the link between them and the VCPU is the asynchronous EE line so we are fine. I'm currently working on moving the XICS into the kernel for performance reasons, however, just like ARM VGIC, I can't seem to find a way to fit in the generic irqchip code in there. It's just not generic at all and quite x86 centric :-) The generic code is for the two apic architectures: x64 and ia64. Don't try to shoehorn it in, you'll damage both the shoe and your foot. So for now I'm just doing my own version of CREATE_IRQCHIP to create it and KVM_INTERRUPT to trigger the various interrupts. None of the mapping stuff (which we really don't need). You mean KVM_IRQ_LINE. KVM_INTERRUPT is a synchronous vcpu ioctl. That's a bit of a problem vs. some of the code qemu-side such as in virtio-pci which does seem to be written around the model exposed by the x86 stuff and relies on doing such mappings so I think we'll have to butcher some of that. Can you elaborate? virtio-pci is pci-centric, there should be nothing x86 specific there. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.
On Wed, 8 Aug 2012 09:27:32 +0100 Peter Maydell peter.mayd...@linaro.org wrote: On 7 August 2012 15:52, Cornelia Huck cornelia.h...@de.ibm.com wrote: +static void sch_handle_clear_func(SubchDev *sch) +{ +struct pmcw *p = sch-curr_status.pmcw; +struct scsw *s = sch-curr_status.scsw; +int path; + +/* Path management: In our simple css, we always choose the only path. */ +path = 0x80; + +/* Reset values prior to 'issueing the clear signal'. */ +p-lpum = 0; This is unnecessary since we're going to set p-lpum to something else about ten lines later, right? My intention here was following the description in the architecture closely. In our case, 'attempting to issue the clear signal' will always succeed, and so we will always set the lpum later on - on real hardware, this may be different. +p-pom = 0xff; +s-pno = 0; + +/* We always 'attempt to issue the clear signal', and we always succeed. */ +sch-orb = NULL; +sch-channel_prog = NULL; +sch-last_cmd = NULL; +s-actl = ~SCSW_ACTL_CLEAR_PEND; +s-stctl |= SCSW_STCTL_STATUS_PEND; + +s-dstat = 0; +s-cstat = 0; +p-lpum = path; + +} -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/15] atomic: introduce atomic operations
Il 08/08/2012 08:25, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com If out of global lock, we will be challenged by SMP in low level, so need atomic ops. This file is heavily copied from kernel. Then it cannot be GPLv2 _or later_. Please use the version that I pointed you to. Paolo Currently, only x86 atomic ops included, and will be extended for other arch for future. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/atomic.h | 161 + 1 files changed, 161 insertions(+), 0 deletions(-) create mode 100644 include/qemu/atomic.h diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h new file mode 100644 index 000..8e1fc3e --- /dev/null +++ b/include/qemu/atomic.h @@ -0,0 +1,161 @@ +/* + * Simple interface for atomic operations. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On 08/08/2012 03:49 AM, David Gibson wrote: We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way. On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and the two PICs are emulated in the kernel. Now the IOAPIC and the PICs correspond to non-x86 interrupt controllers, but the local APIC is more tightly coupled to the core. Interrupt acceptance by the core is an operation that involved synchronous communication with the local APIC: the APIC presents the interrupt, the core accepts it based on the value of the interrupt enable flag and possible a register (CR8), then the APIC updates the ISR and IRR. The upshot is that if the local APIC is in userspace, interrupts must be synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl and HLT is emulated in userspace (so that local APIC emulation can check if an interrupt wakes it up or not). Sorry, still not 100% getting it. When the vcpu is actually running code, that synchronous communication must still be accomplished via the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the communication can't be accomplished in that case. No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/15] atomic: introduce atomic operations
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com If out of global lock, we will be challenged by SMP in low level, so need atomic ops. This file is heavily copied from kernel. Currently, only x86 atomic ops included, and will be extended for other arch for future. I propose we use gcc builtins. We get automatic architecture support, and tuning for newer processors if the user so chooses. http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html In May 2031 we can switch to C11 atomics. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/15] qom: introduce reclaimer to release obj
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Please explain the motivation for this patch. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations
I propose we use gcc builtins. We get automatic architecture support, and tuning for newer processors if the user so chooses. http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html In May 2031 we can switch to C11 atomics. Maybe 2013? -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/15] qom: introduce reclaimer to release obj
Il 08/08/2012 11:05, Avi Kivity ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Please explain the motivation for this patch. It's poor man RCU, I think? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8] kvm: notify host when the guest is panicked
On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote: diff --git a/Documentation/virtual/kvm/pv_event.txt b/Documentation/virtual/kvm/pv_event.txt new file mode 100644 index 000..0ebc890 --- /dev/null +++ b/Documentation/virtual/kvm/pv_event.txt @@ -0,0 +1,32 @@ +The KVM paravirtual event interface += + +Initializing the paravirtual event interface +== +kvm_pv_event_init() +Argiments: + None + +Return Value: + 0 : The guest kernel can't use paravirtual event interface. + -1: The guest kernel can use paravirtual event interface. + This documentation has the can and can't backwards. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using mem_map_lock to protect among updaters. So we can get the intact snapshot of mem topology -- FlatView radix-tree. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |3 +++ memory.c | 22 ++ memory.h |2 ++ 3 files changed, 27 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 8244d54..0e29ef9 100644 --- a/exec.c +++ b/exec.c @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; The bottom level has pointers to MemoryRegionSections. */ static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; +QemuMutex mem_map_lock; + static void io_mem_init(void); static void memory_map_init(void); @@ -637,6 +639,7 @@ void cpu_exec_init_all(void) #if !defined(CONFIG_USER_ONLY) memory_map_init(); io_mem_init(); +qemu_mutex_init(mem_map_lock); #endif } diff --git a/memory.c b/memory.c index aab4a31..5986532 100644 --- a/memory.c +++ b/memory.c @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void) assert(memory_region_transaction_depth); --memory_region_transaction_depth; if (!memory_region_transaction_depth memory_region_update_pending) { +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } } Seems to me that nothing in memory.c can susceptible to races. It must already be called under the big qemu lock, and with the exception of mutators (memory_region_set_*), changes aren't directly visible. I think it's sufficient to take the mem_map_lock at the beginning of core_begin() and drop it at the end of core_commit(). That means all updates of volatile state, phys_map, are protected. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/15] qom: introduce reclaimer to release obj
On 08/08/2012 12:07 PM, Paolo Bonzini wrote: Il 08/08/2012 11:05, Avi Kivity ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Please explain the motivation for this patch. It's poor man RCU, I think? I thought that it was to defer destructors (finalizers) to a more suitable context. But why is the unref context unsuitable? I don't see how it relates to RCU, where is the C and the U? Anyway the list eagerly awaits the explanation. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations
On 08/08/2012 12:05 PM, 陳韋任 (Wei-Ren Chen) wrote: I propose we use gcc builtins. We get automatic architecture support, and tuning for newer processors if the user so chooses. http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html In May 2031 we can switch to C11 atomics. Maybe 2013? Maybe in 2013 we'll get a compiler that supports C11... -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/15] memory: introduce life_ops to MemoryRegion
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com The types of referred object by MemoryRegion are variable, ex, another mr, DeviceState, or other struct defined by drivers. So the refer/unrefer may be different by drivers. Using this ops, we can mange the backend object. Seems to be a needless abstration - we already have lifetime manangement for objects. I suggested previously to replace the opaque parameter with an Object, and use Object's refcounting. That's a lot of work, but IMO is worth it as the opaques are dangerous to live lying around. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using refcnt for mr, so we can separate mr's life cycle management from refered object. When mr-ref 0-1, inc the refered object. When mr-ref 1-0, dec the refered object. The refered object can be DeviceStae, another mr, or other opaque. Please explain the motivation more fully. Usually a MemoryRegion will be embedded within some DeviceState, or its lifecycle will be managed by the DeviceState. So long as we keep the DeviceState alive all associated MemoryRegions should be alive as well. Why not do this directly? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations
On 8 August 2012 07:25, Liu Ping Fan qemul...@gmail.com wrote: +static inline void atomic_sub(int i, Atomic *v) +{ +asm volatile(lock; subl %1,%0 + : +m (v-counter) + : ir (i)); +} NAK. We don't want random inline assembly implementations of locking primitives in QEMU, they are way too hard to keep working with all the possible host architectures we support. I spent some time a while back getting rid of the (variously busted) versions we had previously. If you absolutely must use atomic ops, use the gcc builtins. For preference, stick to higher level and less error-prone abstractions. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8] kvm: notify host when the guest is panicked
At 08/08/2012 05:12 PM, Andrew Jones Wrote: On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote: diff --git a/Documentation/virtual/kvm/pv_event.txt b/Documentation/virtual/kvm/pv_event.txt new file mode 100644 index 000..0ebc890 --- /dev/null +++ b/Documentation/virtual/kvm/pv_event.txt @@ -0,0 +1,32 @@ +The KVM paravirtual event interface += + +Initializing the paravirtual event interface +== +kvm_pv_event_init() +Argiments: +None + +Return Value: +0 : The guest kernel can't use paravirtual event interface. +-1: The guest kernel can use paravirtual event interface. + This documentation has the can and can't backwards. Yes, I will fix it. Thanks Wen Congyang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote: So for now I'm just doing my own version of CREATE_IRQCHIP to create it and KVM_INTERRUPT to trigger the various interrupts. None of the mapping stuff (which we really don't need). You mean KVM_IRQ_LINE. KVM_INTERRUPT is a synchronous vcpu ioctl. Yes, sorry, brain and fingers not agreeing :-) That's a bit of a problem vs. some of the code qemu-side such as in virtio-pci which does seem to be written around the model exposed by the x86 stuff and relies on doing such mappings so I think we'll have to butcher some of that. Can you elaborate? virtio-pci is pci-centric, there should be nothing x86 specific there. The kvm_irqchip_add_msi_route co is a problem, it more/less dives into the x86/ia64 specific routing stuff at a generic level, I'll have to hook things up differently in qemu I suspect. Not a huge deal, we can sort it out. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com PhysMap contain the flatview and radix-tree view, they are snapshot of system topology and should be consistent. With PhysMap, we can swap the pointer when updating and achieve the atomic. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |8 memory.c | 33 - memory.h | 62 -- 3 files changed, 60 insertions(+), 43 deletions(-) diff --git a/exec.c b/exec.c index 0e29ef9..01b91b0 100644 --- a/exec.c +++ b/exec.c @@ -156,8 +156,6 @@ typedef struct PageDesc { #endif /* Size of the L2 (and L3, etc) page tables. */ -#define L2_BITS 10 -#define L2_SIZE (1 L2_BITS) #define P_L2_LEVELS \ (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1) @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask; static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -typedef struct PhysPageEntry PhysPageEntry; This (and the other stuff you're moving) is private memory internals. It should be moved to memory-internals.h (currently named exec-obsolete.h). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 12/15] qdev: using devtree lock to protect device's accessing
On 8 August 2012 07:25, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com lock: qemu_device_tree_mutex Looking at where it's used, this doesn't seem to have anything to do with device trees (ie dtb, see www.devicetree.org) : poorly named lock? -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/15] qom: introduce reclaimer to release obj
Il 08/08/2012 08:25, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/reclaimer.h | 28 ++ main-loop.c |5 qemu-tool.c |5 qom/Makefile.objs|2 +- qom/reclaimer.c | 58 ++ 5 files changed, 97 insertions(+), 1 deletions(-) create mode 100644 include/qemu/reclaimer.h create mode 100644 qom/reclaimer.c diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h new file mode 100644 index 000..9307e93 --- /dev/null +++ b/include/qemu/reclaimer.h @@ -0,0 +1,28 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef QEMU_RECLAIMER +#define QEMU_RECLAIMER + +typedef void ReleaseHandler(void *opaque); +typedef struct Chunk { +QLIST_ENTRY(Chunk) list; +void *opaque; +ReleaseHandler *release; +} Chunk; + +typedef struct ChunkHead { +struct Chunk *lh_first; +} ChunkHead; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release); +void reclaimer_worker(ChunkHead *head); +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release); +void qemu_reclaimer(void); So enqueue is call_rcu and qemu_reclaimer marks a quiescent state + empties the pending call_rcu. But what's the difference between the two pairs of APIs? +#endif diff --git a/main-loop.c b/main-loop.c index eb3b6e6..be9d095 100644 --- a/main-loop.c +++ b/main-loop.c @@ -26,6 +26,7 @@ #include qemu-timer.h #include slirp/slirp.h #include main-loop.h +#include qemu/reclaimer.h #ifndef _WIN32 @@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking) them. */ qemu_bh_poll(); +/* ref to device from iohandler/bh/timer do not obey the rules, so delay + * reclaiming until now. + */ +qemu_reclaimer(); return ret; } diff --git a/qemu-tool.c b/qemu-tool.c index 318c5fc..f5fe319 100644 --- a/qemu-tool.c +++ b/qemu-tool.c @@ -21,6 +21,7 @@ #include main-loop.h #include qemu_socket.h #include slirp/libslirp.h +#include qemu/reclaimer.h #include sys/time.h @@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void) { } +void qemu_reclaimer(void) +{ +} + int use_icount; void qemu_clock_warp(QEMUClock *clock) diff --git a/qom/Makefile.objs b/qom/Makefile.objs index 5ef060a..a579261 100644 --- a/qom/Makefile.objs +++ b/qom/Makefile.objs @@ -1,4 +1,4 @@ -qom-obj-y = object.o container.o qom-qobject.o +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o qom-obj-twice-y = cpu.o common-obj-y = $(qom-obj-twice-y) user-obj-y = $(qom-obj-twice-y) diff --git a/qom/reclaimer.c b/qom/reclaimer.c new file mode 100644 index 000..6cb53e3 --- /dev/null +++ b/qom/reclaimer.c @@ -0,0 +1,58 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include qemu-common.h +#include qemu-thread.h +#include main-loop.h +#include qemu-queue.h +#include qemu/reclaimer.h + +static struct QemuMutex reclaimer_lock; +static QLIST_HEAD(rcl, Chunk) reclaimer_list; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +QLIST_INSERT_HEAD_RCU(head, r, list); +} No lock? +void reclaimer_worker(ChunkHead *head) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, head, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} QLIST_REMOVE needs a lock too, so using the lockless QLIST_INSERT_HEAD_RCU is not necessary. +} + +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +qemu_mutex_lock(reclaimer_lock); +QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list); +qemu_mutex_unlock(reclaimer_lock); +} + + +void qemu_reclaimer(void) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} Same here. +} -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/15] qom: object_unref call reclaimer
Il 08/08/2012 08:25, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com iohandler/bh/timer may use DeviceState when its refcnt=0, postpone the reclaimer till they have done with it. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- qom/object.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/qom/object.c b/qom/object.c index 822bdb7..1452b1b 100644 --- a/qom/object.c +++ b/qom/object.c @@ -23,6 +23,8 @@ #include qbool.h #include qint.h #include qstring.h +#include hw/qdev.h +#include qemu/reclaimer.h #define MAX_INTERFACES 32 @@ -646,7 +648,12 @@ void object_unref(Object *obj) { g_assert(atomic_read(obj-ref) 0); if (atomic_dec_and_test(obj-ref)) { -object_finalize(obj); +/* fixme, maybe introduce obj-finalze to make this more elegant */ +if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) { +qemu_reclaimer_enqueue(obj, object_finalize); +} else { +object_finalize(obj); +} } } Just do this unconditionally. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Flatview and radix view are all under the protection of pointer. And this make sure the change of them seem to be atomic! The mr accessed by radix-tree leaf or flatview will be reclaimed after the prev PhysMap not in use any longer IMO this cleverness should come much later. Let's first take care of dropping the big qemu lock, then make swithcing memory maps more efficient. The initial paths could look like: lookup: take mem_map_lock lookup take ref drop mem_map_lock update: take mem_map_lock (in core_begin) do updates drop memo_map_lock Later we can replace mem_map_lock with either a rwlock or (real) rcu. #if !defined(CONFIG_USER_ONLY) -static void phys_map_node_reserve(unsigned nodes) +static void phys_map_node_reserve(PhysMap *map, unsigned nodes) { -if (phys_map_nodes_nb + nodes phys_map_nodes_nb_alloc) { +if (map-phys_map_nodes_nb + nodes map-phys_map_nodes_nb_alloc) { typedef PhysPageEntry Node[L2_SIZE]; -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16); -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc, - phys_map_nodes_nb + nodes); -phys_map_nodes = g_renew(Node, phys_map_nodes, - phys_map_nodes_nb_alloc); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2, +16); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc, + map-phys_map_nodes_nb + nodes); +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes, + map-phys_map_nodes_nb_alloc); } } Please have a patch that just adds the map parameter to all these functions. This makes the later patch, that adds the copy, easier to read. + +void cur_map_update(PhysMap *next) +{ +qemu_mutex_lock(cur_map_lock); +physmap_put(cur_map); +cur_map = next; +smp_mb(); +qemu_mutex_unlock(cur_map_lock); +} IMO this can be mem_map_lock. If we take my previous suggestion: lookup: take mem_map_lock lookup take ref drop mem_map_lock update: take mem_map_lock (in core_begin) do updates drop memo_map_lock And update it to update: prepare next_map (in core_begin) do updates take mem_map_lock (in core_commit) switch maps drop mem_map_lock free old map Note the lookup path copies the MemoryRegionSection instead of referencing it. Thus we can destroy the old map without worrying; the only pointers will point to MemoryRegions, which will be protected by the refcounts on their Objects. This can be easily switched to rcu: update: prepare next_map (in core_begin) do updates switch maps - rcu_assign_pointer call_rcu(free old map) (or synchronize_rcu; free old maps) Again, this should be done after the simplictic patch that enables parallel lookup but keeps just one map. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/15] lock: introduce global lock for device tree
On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Please explain the motivation. AFAICT, the big qemu lock is sufficient. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/15] lock: introduce global lock for device tree
Il 08/08/2012 08:25, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- cpus.c | 12 main-loop.h |3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..a734b36 100644 --- a/cpus.c +++ b/cpus.c @@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void) } #endif /* _WIN32 */ +QemuMutex qemu_device_tree_mutex; QemuMutex qemu_global_mutex; static QemuCond qemu_io_proceeded_cond; static bool iothread_requesting_mutex; @@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void) qemu_cond_init(qemu_work_cond); qemu_cond_init(qemu_io_proceeded_cond); qemu_mutex_init(qemu_global_mutex); +qemu_mutex_init(qemu_device_tree_mutex); qemu_thread_get_self(io_thread); } @@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void) qemu_mutex_unlock(qemu_global_mutex); } +void qemu_lock_devtree(void) +{ +qemu_mutex_lock(qemu_device_tree_mutex); +} + +void qemu_unlock_devtree(void) +{ +qemu_mutex_unlock(qemu_device_tree_mutex); +} We don't need the wrappers. They are there for the big lock just because TCG needs extra work for iothread_requesting_mutex. Paolo static int all_vcpus_paused(void) { CPUArchState *penv = first_cpu; diff --git a/main-loop.h b/main-loop.h index dce1cd9..17e959a 100644 --- a/main-loop.h +++ b/main-loop.h @@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void); */ void qemu_mutex_unlock_iothread(void); +void qemu_lock_devtree(void); +void qemu_unlock_devtree(void); + /* internal interfaces */ void qemu_fd_register(int fd); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views
Il 08/08/2012 08:25, Liu Ping Fan ha scritto: +void qdev_unplug_complete(DeviceState *dev, Error **errp) +{ +/* isolate from mem view */ +qdev_unmap(dev); +qemu_lock_devtree(); +/* isolate from device tree */ +qdev_unset_parent(dev); +qemu_unlock_devtree(); +object_unref(OBJECT(dev)); Rather than deferring the free, you should defer the unref. Otherwise the following can happen when you have real RCU access to the memory map on the read-side: VCPU threadI/O thread = get MMIO request rcu_read_lock() walk memory map qdev_unmap() lock_devtree() ... unlock_devtree unref dev - refcnt=0, free enqueued ref() rcu_read_unlock() free() dangling pointer! If you defer the unref, you have instead VCPU threadI/O thread = get MMIO request rcu_read_lock() walk memory map qdev_unmap() lock_devtree() ... unlock_devtree unref is enqueued ref() - refcnt = 2 rcu_read_unlock() unref() - refcnt=1 unref() - refcnt = 1 So this also makes patch 14 unnecessary. Paolo +} -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] e1000: using new interface--unmap to unplug
Il 08/08/2012 08:25, Liu Ping Fan ha scritto: +static void +pci_e1000_unmap(PCIDevice *p) +{ +/* DO NOT FREE anything!until refcnt=0 */ +/* isolate from memory view */ +} At least you need to call the superclass method. Paolo static int pci_e1000_uninit(PCIDevice *dev) { @@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void *data) PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); k-init = pci_e1000_init; +k-unmap = pci_e1000_unmap; k-exit = pci_e1000_uninit; k-romfile = pxe-e1000.rom; k-vendor_id = PCI_VENDOR_ID_INTEL; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views
On 08/08/2012 12:52 PM, Paolo Bonzini wrote: Il 08/08/2012 08:25, Liu Ping Fan ha scritto: +void qdev_unplug_complete(DeviceState *dev, Error **errp) +{ +/* isolate from mem view */ +qdev_unmap(dev); +qemu_lock_devtree(); +/* isolate from device tree */ +qdev_unset_parent(dev); +qemu_unlock_devtree(); +object_unref(OBJECT(dev)); Rather than deferring the free, you should defer the unref. Otherwise the following can happen when you have real RCU access to the memory map on the read-side: VCPU threadI/O thread = get MMIO request rcu_read_lock() walk memory map qdev_unmap() lock_devtree() ... unlock_devtree unref dev - refcnt=0, free enqueued ref() rcu_read_unlock() free() dangling pointer! unref should follow either synchronize_rcu(), or be in a call_rcu() callback (deferring the unref). IMO synchronize_rcu() is sufficient here, unplug is a truly slow path, esp. on real hardware. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation
On 07.08.2012, at 17:59, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, August 07, 2012 3:38 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Bhushan Bharat-R65777 Subject: Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation On 03.08.2012, at 07:30, Bharat Bhushan wrote: This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_BOOKE_WATCHDOG) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com [bharat.bhus...@freescale.com: reworked patch] Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v7: - kvmppc_core_dequeue_watchdog() and kvmppc_core_enque_watchdog() are made staic - Clear the KVM_REQ_WATCHDOG request when TSR_ENW | ENW_WIS cleared rather than checking these bits on handling the request. Below changes (pasted for quick understanding) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 7eedaeb..a0f5922 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -629,8 +629,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) goto out; } - if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) - (vcpu-arch.tsr (TSR_ENW | TSR_WIS))) { + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) { kvm_run-exit_reason = KVM_EXIT_WATCHDOG; ret = 0; goto out; @@ -1098,8 +1097,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, } } - if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) - (vcpu-arch.tsr (TSR_ENW | TSR_WIS))) { + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) { run-exit_reason = KVM_EXIT_WATCHDOG; r = RESUME_HOST | (r RESUME_FLAG_NV); } @@ -1251,8 +1249,10 @@ static int set_sregs_base(struct kvm_vcpu *vcpu, vcpu-arch.tsr = sregs-u.e.tsr; - if ((old_tsr ^ vcpu-arch.tsr) (TSR_ENW | TSR_WIS)) + if ((old_tsr ^ vcpu-arch.tsr) (TSR_ENW | TSR_WIS)) { + clear_bit(KVM_REQ_WATCHDOG, vcpu-requests); arm_next_watchdog(vcpu); + } update_timer_ints(vcpu); } @@ -1434,8 +1434,10 @@ void kvmppc_clr_tsr_bits(struct kvm_vcpu *vcpu, u32 tsr_bits) * We may have stopped the watchdog due to * being stuck on final expiration. */ - if (tsr_bits (TSR_ENW | TSR_WIS)) + if (tsr_bits (TSR_ENW | TSR_WIS)) { + clear_bit(KVM_REQ_WATCHDOG, vcpu-requests); arm_next_watchdog(vcpu); + } -- v6: - Added kvmppc_subarch_vcpu_unit/uninit() for subarch specific initialization - Using spin_lock_irqsave() - Using del_timer_sync(). v5: - Checking that TSR_ENW/TSR_WIS are still set on KVM_EXIT_WATCHDOG. - Moved watchdog_next_timeout() in lock. arch/powerpc/include/asm/kvm_host.h |3 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/asm/reg_booke.h |7 ++ arch/powerpc/kvm/book3s.c|9 ++ arch/powerpc/kvm/booke.c | 158 ++ arch/powerpc/kvm/booke_emulate.c |8 ++ arch/powerpc/kvm/powerpc.c | 14 +++- include/linux/kvm.h |2 + include/linux/kvm_host.h |1 + 9 files changed, 202 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 572ad01..873ec11 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -469,6 +469,8 @@ struct kvm_vcpu_arch { ulong fault_esr; ulong queued_dear; ulong queued_esr; + spinlock_t wdt_lock; + struct timer_list wdt_timer; u32 tlbcfg[4]; u32 mmucfg; u32 epr; @@ -484,6 +486,7 @@ struct kvm_vcpu_arch { u8 osi_needed; u8 osi_enabled; u8 papr_enabled; + u8 watchdog_enabled; u8 sane; u8 cpu_type; u8 hcall_needed; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0124937..1438a5e 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -68,6 +68,8 @@ extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu); extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb); extern void kvmppc_decrementer_func(unsigned long data); extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu); +extern int kvmppc_subarch_vcpu_init(struct kvm_vcpu *vcpu); +extern void
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote: On 08/08/2012 03:49 AM, David Gibson wrote: We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way. On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and the two PICs are emulated in the kernel. Now the IOAPIC and the PICs correspond to non-x86 interrupt controllers, but the local APIC is more tightly coupled to the core. Interrupt acceptance by the core is an operation that involved synchronous communication with the local APIC: the APIC presents the interrupt, the core accepts it based on the value of the interrupt enable flag and possible a register (CR8), then the APIC updates the ISR and IRR. The upshot is that if the local APIC is in userspace, interrupts must be synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl and HLT is emulated in userspace (so that local APIC emulation can check if an interrupt wakes it up or not). Sorry, still not 100% getting it. When the vcpu is actually running code, that synchronous communication must still be accomplished via the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the communication can't be accomplished in that case. No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel. Aha! Ok, understood. Uh, assuming you meant kernelspace, not userspace in the first line there, anyway. Ok, so I am now reassured that our current handling of CEDE in kernelspace does not cause problems. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: correctly detect APIC SW state in kvm_apic_post_state_restore()
For apic_set_spiv() to track APIC SW state correctly it needs to see previous and next values of the spurious vector register, but currently memset() overwrite the old value before apic_set_spiv() get a chance to do tracking. Fix it by calling apic_set_spiv() before overwriting old value. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index f019ca2..d72d79a 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1389,13 +1389,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu) return vector; } -void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu) +void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, + struct kvm_lapic_state *s) { struct kvm_lapic *apic = vcpu-arch.apic; kvm_lapic_set_base(vcpu, vcpu-arch.apic_base); + /* set SPIV separately to get count of SW disabled APICs right */ + apic_set_spiv(apic, *((u32 *)(s-regs + APIC_SPIV))); + memcpy(vcpu-arch.apic-regs, s-regs, sizeof *s); kvm_apic_set_version(vcpu); - apic_set_spiv(apic, kvm_apic_get_reg(apic, APIC_SPIV)); apic_update_ppr(apic); hrtimer_cancel(apic-lapic_timer.timer); diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 2ad9caa..615a8b0 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -54,7 +54,8 @@ int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type); u64 kvm_get_apic_base(struct kvm_vcpu *vcpu); void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data); -void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu); +void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, + struct kvm_lapic_state *s); int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu); u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 042754b..3c1ccea 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2345,8 +2345,7 @@ static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { - memcpy(vcpu-arch.apic-regs, s-regs, sizeof *s); - kvm_apic_post_state_restore(vcpu); + kvm_apic_post_state_restore(vcpu, s); update_cr8_intercept(vcpu); return 0; -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86 emulator: access GPRs on demand
On 08/08/2012 12:23 AM, Marcelo Tosatti wrote: @@ -281,8 +294,10 @@ struct x86_emulate_ctxt { bool rip_relative; unsigned long _eip; struct operand memop; +u32 regs_valid; /* bitmaps of registers in _regs[] that can be read */ +u32 regs_dirty; /* bitmaps of registers in _regs[] that have been written */ emul_regs_dirty (to avoid with the other regs_dirty). Well, it's in x86_emulate_ctxt, and there are other fields that can be confused here. I think it's okay as they're never used together. @@ -4453,7 +4447,6 @@ int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip) return EMULATE_FAIL; ctxt-eip = ctxt-_eip; -memcpy(vcpu-arch.regs, ctxt-regs, sizeof ctxt-regs); kvm_rip_write(vcpu, ctxt-eip); kvm_set_rflags(vcpu, ctxt-eflags); Need to writeback here? In emulate_in_real(). Good catch. @@ -4601,7 +4594,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, changes registers values during IO operation */ if (vcpu-arch.emulate_regs_need_sync_from_vcpu) { vcpu-arch.emulate_regs_need_sync_from_vcpu = false; -memcpy(ctxt-regs, vcpu-arch.regs, sizeof ctxt-regs); +ctxt-regs_valid = 0; } I think you can improve this hack now (perhaps invalidate the emulator cache on SET_REGS?). This is dangerous, if someone does a GET_REGS/SET_REGS pair within an mmio transaction. The documentation implies it should not be done, but it doesn't say so explicitly. It's likely qemu allows it if the mmio transaction drops the lock, for example. I agree that doing it in SET_REGS is better conceptually. vcpu-arch.emulate_regs_need_sync_to_vcpu = false; } regs-rax = kvm_register_read(vcpu, VCPU_REGS_RAX); @@ -5724,6 +5720,7 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index, { struct x86_emulate_ctxt *ctxt = vcpu-arch.emulate_ctxt; int ret; +unsigned reg; init_emulate_ctxt(vcpu); @@ -5733,7 +5730,8 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index, if (ret) return EMULATE_FAIL; -memcpy(vcpu-arch.regs, ctxt-regs, sizeof ctxt-regs); +for_each_set_bit(reg, (ulong *)ctxt-regs_dirty, 16) +kvm_register_write(vcpu, reg, ctxt-_regs[reg]); Should update regs_avail/regs_dirty? Better to do any of that in emulator.c via interfaces. emulator_task_switch() should do it, yes. kvm_rip_write(vcpu, ctxt-eip); kvm_set_rflags(vcpu, ctxt-eflags); kvm_make_request(KVM_REQ_EVENT, vcpu); -- 1.7.11.2 Did not double check that emulator.c convertion is complete with your patch (which should be done). I did s/regs/_regs/ for that. Or did you mean something else? Regarding emulator interfaces, perhaps we should make all entry points regular: emulator_init(ctxt, ops, vcpu); // just once emulator_reset(ctxt) while ((ret = emulator_entry_point(ctxt)) == EMULATOR_NEED_DATA) get that data from somewhere emulator_reset() simply resets the state machine, can be as simple as clearing a bool flag. emulator_entry_point (can be x86_emulate_insn(), x86_emulate_int(), x86_emulate_task_switch(), etc.) runs until one of the callbacks tells it to stop to get more data. (of course the while loop is actually in userspace) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On 08/08/2012 02:59 PM, David Gibson wrote: No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel. Aha! Ok, understood. Uh, assuming you meant kernelspace, not userspace in the first line there, anyway. I did. Ok, so I am now reassured that our current handling of CEDE in kernelspace does not cause problems. Great. It's a real pity the original local APIC implementation was in userspace, it causes no end of confusion, and in fact was broken for smp until recently even though it is 7 years old. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations
On Wed, Aug 8, 2012 at 10:21 AM, Peter Maydell peter.mayd...@linaro.org wrote: On 8 August 2012 07:25, Liu Ping Fan qemul...@gmail.com wrote: +static inline void atomic_sub(int i, Atomic *v) +{ +asm volatile(lock; subl %1,%0 + : +m (v-counter) + : ir (i)); +} NAK. We don't want random inline assembly implementations of locking primitives in QEMU, they are way too hard to keep working with all the possible host architectures we support. I spent some time a while back getting rid of the (variously busted) versions we had previously. If you absolutely must use atomic ops, use the gcc builtins. For preference, stick to higher level and less error-prone abstractions. We're spoilt for choice here: 1. Atomic built-ins from gcc 2. glib atomics No need to roll our own or copy the implementation from the kernel. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations
Il 08/08/2012 15:09, Stefan Hajnoczi ha scritto: NAK. We don't want random inline assembly implementations of locking primitives in QEMU, they are way too hard to keep working with all the possible host architectures we support. I spent some time a while back getting rid of the (variously busted) versions we had previously. If you absolutely must use atomic ops, use the gcc builtins. For preference, stick to higher level and less error-prone abstractions. We're spoilt for choice here: 1. Atomic built-ins from gcc 2. glib atomics No need to roll our own or copy the implementation from the kernel. To some extent we need to because: 1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some versions of GCC mb is known to be (wrongly) a no-op. 2. glib atomics do not provide mb/rmb/wmb either, and g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers everywhere, while it is clearer if you put barriers manually, and you often do not need barriers in the get side. glib atomics also do not provide xchg. I agree however that a small wrapper around GCC atomics is much better than assembly. Assembly can be limited to the memory barriers (where we already have it anyway). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 01/15] atomic: introduce atomic operations
On 8 August 2012 14:18, Paolo Bonzini pbonz...@redhat.com wrote: Il 08/08/2012 15:09, Stefan Hajnoczi ha scritto: No need to roll our own or copy the implementation from the kernel. To some extent we need to because: 1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some versions of GCC mb is known to be (wrongly) a no-op. 2. glib atomics do not provide mb/rmb/wmb either, and g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers everywhere, while it is clearer if you put barriers manually, and you often do not need barriers in the get side. glib atomics also do not provide xchg. These are arguments in favour of don't try to use atomic ops -- if serious large projects like GCC and glib can't produce working efficient implementations for all target architectures, what chance do we have? -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/15] atomic: introduce atomic operations
Il 08/08/2012 15:32, Peter Maydell ha scritto: 1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some versions of GCC mb is known to be (wrongly) a no-op. 2. glib atomics do not provide mb/rmb/wmb either, and g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers everywhere, while it is clearer if you put barriers manually, and you often do not need barriers in the get side. glib atomics also do not provide xchg. These are arguments in favour of don't try to use atomic ops -- if serious large projects like GCC and glib can't produce working efficient implementations for all target architectures, what chance do we have? Well, maybe... but the flaws in both GCC and glib are small in size (even though large in importance, at least for us) and we can work around them easily. mb/rmb/wmb is essentially the small set of atomic operations that we're already using. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/15] atomic: introduce atomic operations
On 08/08/2012 04:49 PM, Paolo Bonzini wrote: Il 08/08/2012 15:32, Peter Maydell ha scritto: 1. GCC atomics look ugly, :) do not provide rmb/wmb, and in some versions of GCC mb is known to be (wrongly) a no-op. 2. glib atomics do not provide mb/rmb/wmb either, and g_atomic_int_get/g_atomic_int_set are inefficient: they add barriers everywhere, while it is clearer if you put barriers manually, and you often do not need barriers in the get side. glib atomics also do not provide xchg. These are arguments in favour of don't try to use atomic ops -- if serious large projects like GCC and glib can't produce working efficient implementations for all target architectures, what chance do we have? Well, maybe... but the flaws in both GCC and glib are small in size (even though large in importance, at least for us) and we can work around them easily. mb/rmb/wmb is essentially the small set of atomic operations that we're already using. We can easily define rmb()/wmb() to be __sync_synchronize() as a default, and only override them where it matters. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for tuesday 31
Am 31.01.2012 15:01, schrieb Mitsyanko Igor: On 01/31/2012 05:15 PM, Andreas Färber wrote: Am 31.01.2012 00:53, schrieb Anthony Liguori: On 01/30/2012 05:41 PM, Andreas Färber wrote: Am 30.01.2012 19:55, schrieb Juan Quintela: Please send in any agenda items you are interested in covering. VMState: Anthony specifically said that VMState were not affected by QOM and that patches should not be deferred until the merge. Yet there's no review and/or decision-making for a month now. Ping^2 for AHCI+SDHC. Do you have pointers (to pending VMState patches)? http://patchwork.ozlabs.org/patch/137732/ (PATCH v4) It's basically about how to deal with variable-sized arrays. (Alex mentioned it on one call around November.) I found ways to deal with subsets of arrays embedded within the struct and variable-sized list of pointers to structs but no solution for a malloc()'ed array of structs. Maybe I'm just too stupid to see. Anyway, no one commented since Xmas. Igor posted (and refined for v2) a patch with a callback-based approach that I find promising. From my view, unofficially Juan is the VMState guy, he's been cc'ed. Are we lacking an official maintainer that cares? Or is Juan the official, undocumented maintainer but simply busy? SUSE's interest is making AHCI migratable, and my VMState workaround for that is simply ugly: http://patchwork.ozlabs.org/patch/133066/ (RFC) If I'm not mistaken, if you change AHCIState's .ports type to uint32_t you can use existing VMSTATE_BUFFER_MULTIPLY macro like this: VMSTATE_BUFFER_MULTIPLY(dev, AHCIState, 0, NULL, 0, ports, sizeof(AHCIDevice)) Igor, I finally got around to rebasing and trying this: Am I seeing correctly that this tries to serialize the whole of AHCIDevice as an opaque buffer? The difficulty here was that we were looking for a way to serialize a variable number of structured elements with their own vmstate_ahci_device specifying what fields to serialize. Juan, how should we proceed there? Do as Igor suggested? Some VMSTATE_ macro I'm overlooking? Or do we need some new macro for this use case? Regards, Andreas VMSTATE_BUFFER_MULTIPLY currently lacks VMS_POINTER flag and therefore doesn't make any use of _start field (you don't need it anyway) Nevertheless, VMSTATE_BUFFER_MULTIPLY is just a partial solution to a bigger set of possible problems. SD card's vmstate implementation requires shift operation, SDHC gets size from switch {} statement, something else later may require division or addition e.t.c., get_bufsize callback will cover all possible cases. -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
hang on reboot with 3.6-rc1
Not sure if KVM is the culprit, but it is the last line shown on the console. I have to power cycle the server to reboot. David attachment: kvm-3.6-rc1.png
Re: [kvmarm] [PATCH 5/8] KVM: Add hva_to_memslot
On 08.08.2012, at 06:55, Christoffer Dall wrote: On Tue, Aug 7, 2012 at 6:57 AM, Alexander Graf ag...@suse.de wrote: Architecture code might want to figure out if an hva that it gets for example via the mmu notifier callbacks actually is in guest mapped memory, and if so, in which memory slot. This patch introduces a helper function to enable it to do so. It is a prerequisite for the e500 mmu notifier implementation. Signed-off-by: Alexander Graf ag...@suse.de --- include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c | 14 ++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index dbc65f9..2b92928 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -464,6 +464,7 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len); struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); +struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, hva_t hva); int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn); unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn); void mark_page_dirty(struct kvm *kvm, gfn_t gfn); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bcf973e..d42591d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -999,6 +999,20 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn) } EXPORT_SYMBOL_GPL(gfn_to_memslot); +struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, hva_t hva) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + struct kvm_memory_slot *memslot; + + kvm_for_each_memslot(memslot, slots) + if (hva = memslot-userspace_addr + hva memslot-userspace_addr + memslot-npages) addr + npages, this doesn't look right Thanks a lot for spotting that one! + return memslot; + + return NULL; +} +EXPORT_SYMBOL_GPL(hva_to_memslot); consider also adding a hva_to_gpa wrapper now when you're add it, then ARM code will be so happy: bool hva_to_gpa(struct kvm *kvm, unsigned long hva, gpa_t *gpa) { struct kvm_memory_slot *memslot; memslot = hva_to_memslot(kvm, hva); if (!memslot) return false; gpa_t gpa_offset = hva - memslot-userspace_addr; *gpa = (memslot-base_gfn PAGE_SHIFT) + gpa_offset; return true; } What do you need this for? I usually don't like to add framework code that has no users (yet). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] KVM: PPC: E500: Implement MMU Notifiers
On 07.08.2012, at 12:57, Alexander Graf wrote: This patch set adds a very simple implementation of MMU notifiers for the e500 target. Along the way, I stumbled over a few other things that I've put into the same patch set, namely: * e500 prerequisites * icache flushing on page map (probably also hits ARM!) * exit trace point for e500 Alexander Graf (8): KVM: PPC: BookE: Expose remote TLB flushes in debugfs KVM: PPC: E500: Fix clear_tlb_refs KVM: PPC: PR: Use generic tracepoint for guest exit KVM: PPC: Expose SYNC cap based on mmu notifiers I applied the above 4 patches to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] handle device help before accelerator set up
A command line device probe using just -device ? gets processed after qemu-kvm initializes the accelerator. If /dev/kvm is not present, the accelerator check will fail (kvm is defaulted to on), which causes libvirt to not be set up to handle qemu guests. Moving the device help handling before the accelerator set up allows the device probe to work in this configuration and libvirt succeeds in setting up for a qemu hypervisor mode. Signed-off-by: Bruce Rogers brog...@suse.com --- vl.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/vl.c b/vl.c index 1a46d2d..5b75cf9 100644 --- a/vl.c +++ b/vl.c @@ -3380,6 +3380,9 @@ int main(int argc, char **argv, char **envp) ram_size = DEFAULT_RAM_SIZE * 1024 * 1024; } +if (qemu_opts_foreach(qemu_find_opts(device), device_help_func, NULL, 0) != 0) +exit(0); + configure_accelerator(); qemu_init_cpu_loop(); @@ -3535,9 +3538,6 @@ int main(int argc, char **argv, char **envp) } select_vgahw(vga_model); -if (qemu_opts_foreach(qemu_find_opts(device), device_help_func, NULL, 0) != 0) -exit(0); - if (watchdog) { i = select_watchdog(watchdog); if (i 0) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v8 5/6] introduce a new qom device to deal with panicked event
On Wed, Aug 8, 2012 at 2:47 AM, Wen Congyang we...@cn.fujitsu.com wrote: If the target is x86/x86_64, the guest's kernel will write 0x01 to the port KVM_PV_EVENT_PORT when it is panciked. This patch introduces a new qom device kvm_pv_ioport to listen this I/O port, and deal with panicked event according to panicked_action's value. The possible actions are: 1. emit QEVENT_GUEST_PANICKED only 2. emit QEVENT_GUEST_PANICKED and pause the guest 3. emit QEVENT_GUEST_PANICKED and poweroff the guest 4. emit QEVENT_GUEST_PANICKED and reset the guest I/O ports does not work for some targets(for example: s390). And you can implement another qom device, and include it's code into pv_event.c for such target. Note: if we emit QEVENT_GUEST_PANICKED only, and the management application does not receive this event(the management may not run when the event is emitted), the management won't know the guest is panicked. Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- hw/kvm/Makefile.objs |2 +- hw/kvm/pv_event.c| 109 ++ hw/kvm/pv_ioport.c | 93 ++ hw/pc_piix.c |9 kvm.h|2 + 5 files changed, 214 insertions(+), 1 deletions(-) create mode 100644 hw/kvm/pv_event.c create mode 100644 hw/kvm/pv_ioport.c diff --git a/hw/kvm/Makefile.objs b/hw/kvm/Makefile.objs index 226497a..23e3b30 100644 --- a/hw/kvm/Makefile.objs +++ b/hw/kvm/Makefile.objs @@ -1 +1 @@ -obj-$(CONFIG_KVM) += clock.o apic.o i8259.o ioapic.o i8254.o +obj-$(CONFIG_KVM) += clock.o apic.o i8259.o ioapic.o i8254.o pv_event.o diff --git a/hw/kvm/pv_event.c b/hw/kvm/pv_event.c new file mode 100644 index 000..8897237 --- /dev/null +++ b/hw/kvm/pv_event.c @@ -0,0 +1,109 @@ +/* + * QEMU KVM support, paravirtual event device + * + * Copyright Fujitsu, Corp. 2012 + * + * Authors: + * Wen Congyang we...@cn.fujitsu.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include linux/kvm_para.h +#include asm/kvm_para.h +#include qobject.h +#include qjson.h +#include monitor.h +#include sysemu.h +#include kvm.h + +/* Possible values for action parameter. */ +#define PANICKED_REPORT 1 /* emit QEVENT_GUEST_PANICKED only */ +#define PANICKED_PAUSE 2 /* emit QEVENT_GUEST_PANICKED and pause VM */ +#define PANICKED_POWEROFF 3 /* emit QEVENT_GUEST_PANICKED and quit VM */ +#define PANICKED_RESET 4 /* emit QEVENT_GUEST_PANICKED and reset VM */ + +#define PV_EVENT_DRIVER kvm_pv_event + +struct pv_event_action { PVEventAction +char *panicked_action; +int panicked_action_value; +}; + +#define DEFINE_PV_EVENT_PROPERTIES(_state, _conf) \ +DEFINE_PROP_STRING(panicked_action, _state, _conf.panicked_action) + +static void panicked_mon_event(const char *action) +{ +QObject *data; + +data = qobject_from_jsonf({ 'action': %s }, action); +monitor_protocol_event(QEVENT_GUEST_PANICKED, data); +qobject_decref(data); +} + +static void panicked_perform_action(uint32_t panicked_action) +{ +switch (panicked_action) { +case PANICKED_REPORT: +panicked_mon_event(report); +break; + +case PANICKED_PAUSE: +panicked_mon_event(pause); +vm_stop(RUN_STATE_GUEST_PANICKED); +break; + +case PANICKED_POWEROFF: +panicked_mon_event(poweroff); +qemu_system_shutdown_request(); +break; Misses a line break unlike other cases. +case PANICKED_RESET: +panicked_mon_event(reset); +qemu_system_reset_request(); +break; +} +} + +static uint64_t supported_event(void) +{ +return 1 KVM_PV_FEATURE_PANICKED; +} + +static void handle_event(int event, struct pv_event_action *conf) +{ +if (event == KVM_PV_EVENT_PANICKED) { +panicked_perform_action(conf-panicked_action_value); +} +} + +static int pv_event_init(struct pv_event_action *conf) +{ +if (!conf-panicked_action) { +conf-panicked_action_value = PANICKED_REPORT; +} else if (strcasecmp(conf-panicked_action, none) == 0) { +conf-panicked_action_value = PANICKED_REPORT; +} else if (strcasecmp(conf-panicked_action, pause) == 0) { +conf-panicked_action_value = PANICKED_PAUSE; +} else if (strcasecmp(conf-panicked_action, poweroff) == 0) { +conf-panicked_action_value = PANICKED_POWEROFF; +} else if (strcasecmp(conf-panicked_action, reset) == 0) { +conf-panicked_action_value = PANICKED_RESET; +} else { +return -1; +} + +return 0; +} + +#if defined(KVM_PV_EVENT_PORT) + +#include pv_ioport.c I'd rather not include any .c files but insert the contents here directly. + +#else +void kvm_pv_event_init(void *opaque) +{ +} +#endif diff
Re: [Qemu-devel] [PATCH 3/5] s390: Add new channel I/O based virtio transport.
On Wed, Aug 8, 2012 at 8:28 AM, Cornelia Huck cornelia.h...@de.ibm.com wrote: On Tue, 7 Aug 2012 20:47:22 + Blue Swirl blauwir...@gmail.com wrote: diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c new file mode 100644 index 000..8a90c3a --- /dev/null +++ b/hw/s390x/virtio-ccw.c @@ -0,0 +1,962 @@ +/* + * virtio ccw target implementation + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#include hw/hw.h +#include block.h +#include blockdev.h +#include sysemu.h +#include net.h +#include monitor.h +#include qemu-thread.h +#include ../virtio.h +#include ../virtio-serial.h +#include ../virtio-net.h +#include ../sysbus.h hw/virtio... for the above OK. +#include bitops.h + +#include ioinst.h +#include css.h +#include virtio-ccw.h + +static const TypeInfo virtio_ccw_bus_info = { +.name = TYPE_VIRTIO_CCW_BUS, +.parent = TYPE_BUS, +.instance_size = sizeof(VirtioCcwBus), +}; + +static const VirtIOBindings virtio_ccw_bindings; + +typedef struct sch_entry { +SubchDev *sch; +QLIST_ENTRY(sch_entry) entry; +} sch_entry; SubchEntry, see CODING_STYLE. Also other struct and typedef names below. + +QLIST_HEAD(subch_list, sch_entry); static, but please put this to a structure that is passed around instead. + +typedef struct devno_entry { +uint16_t devno; +QLIST_ENTRY(devno_entry) entry; +} devno_entry; + +QLIST_HEAD(devno_list, devno_entry); Ditto + +struct subch_set { +struct subch_list *s_list[256]; +struct devno_list *d_list[256]; +}; + +struct css_set { +struct subch_set *set[MAX_SSID + 1]; +}; + +static struct css_set *channel_subsys[MAX_CSSID + 1]; OK, will try to come up with some kind of structure for this and CamelCasify it. + +VirtIODevice *virtio_ccw_get_vdev(SubchDev *sch) +{ +VirtIODevice *vdev = NULL; + +if (sch-driver_data) { +vdev = ((VirtioCcwData *)sch-driver_data)-vdev; +} +return vdev; +} + +VirtioCcwBus *virtio_ccw_bus_init(void) +{ +VirtioCcwBus *bus; +BusState *_bus; Please avoid identifiers with leading underscores. OK. +DeviceState *dev; + +css_set_subch_cb(virtio_ccw_find_subch); + +/* Create bridge device */ +dev = qdev_create(NULL, virtio-ccw-bridge); +qdev_init_nofail(dev); + +/* Create bus on bridge device */ +_bus = qbus_create(TYPE_VIRTIO_CCW_BUS, dev, virtio-ccw); +bus = DO_UPCAST(VirtioCcwBus, bus, _bus); + +/* Enable hotplugging */ +_bus-allow_hotplug = 1; + +return bus; +} + +struct vq_info_block { +uint64_t queue; +uint16_t num; +} QEMU_PACKED; + +struct vq_config_block { +uint16_t index; +uint16_t num; +} QEMU_PACKED; Aren't these KVM structures? They should be defined in a KVM header file file in linux-headers. Not really, virtio-ccw isn't tied to kvm. I see this more as command blocks that are specific to the control unit - like something that would be defined in an attachment specification for a classic s390 device (and in the virtio spec in this case) and modeled as C structures here. OK. Then please use CamelCase for these too. +case CCW_CMD_WRITE_CONF: +if (check_len) { +if (ccw-count data-vdev-config_len) { +ret = -EINVAL; +break; +} +} +len = MIN(ccw-count, data-vdev-config_len); +config = qemu_get_ram_ptr(ccw-cda); Please use cpu_physical_memory_read() (or DMA versions) instead of this + memcpy(). Will check. +if (!config) { +ret = -EFAULT; +} else { +memcpy(data-vdev-config, config, len); +if (data-vdev-set_config) { +data-vdev-set_config(data-vdev, data-vdev-config); +} +sch-curr_status.scsw.count = ccw-count - len; +ret = 0; +} +break; +case CCW_CMD_READ_VQ_CONF: +if (check_len) { +if (ccw-count != sizeof(vq_config)) { +ret = -EINVAL; +break; +} +} else if (ccw-count sizeof(vq_config)) { +/* Can't execute command. */ +ret = -EINVAL; +break; +} +if (!qemu_get_ram_ptr(ccw-cda)) { +ret = -EFAULT; +} else { +vq_config.index = lduw_phys(ccw-cda); lduw_{b,l}e_phys() +vq_config.num = virtio_queue_get_num(data-vdev, vq_config.index); +stw_phys(ccw-cda + sizeof(vq_config.index), vq_config.num);
Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.
On Wed, Aug 8, 2012 at 8:17 AM, Cornelia Huck cornelia.h...@de.ibm.com wrote: On Tue, 7 Aug 2012 21:00:59 + Blue Swirl blauwir...@gmail.com wrote: diff --git a/hw/s390x/css.c b/hw/s390x/css.c new file mode 100644 index 000..7941c44 --- /dev/null +++ b/hw/s390x/css.c @@ -0,0 +1,440 @@ +/* + * Channel subsystem base support. + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#include qemu-thread.h +#include qemu-queue.h +#include hw/qdev.h +#include kvm.h +#include cpu.h +#include ioinst.h +#include css.h + +struct chp_info { CamelCase, please. OK. +uint8_t in_use; +uint8_t type; +}; + +static struct chp_info chpids[MAX_CSSID + 1][MAX_CHPID + 1]; + +static css_subch_cb_func css_subch_cb; Probably these can be put to a container structure which can be passed around. Still trying to come up with a good model for that. +case CCW_CMD_SENSE_ID: +{ +uint8_t sense_bytes[256]; + +/* Sense ID information is device specific. */ +memcpy(sense_bytes, sch-id, sizeof(sense_bytes)); +if (check_len) { +if (ccw-count != sizeof(sense_bytes)) { +ret = -EINVAL; +break; +} +} +len = MIN(ccw-count, sizeof(sense_bytes)); +/* + * Only indicate 0xff in the first sense byte if we actually + * have enough place to store at least bytes 0-3. + */ +if (len = 4) { +stb_phys(ccw-cda, 0xff); +} else { +stb_phys(ccw-cda, 0); +} +i = 1; +for (i = 1; i len - 1; i++) { +stb_phys(ccw-cda + i, sense_bytes[i]); +} cpu_physical_memory_write() Hm, what's wrong with storing byte-by-byte? cpu_physical_memory_write() could be more optimal, for example resolve guest addresses only once per page. +sch-curr_status.scsw.count = ccw-count - len; +ret = 0; +break; +} +case CCW_CMD_TIC: +if (sch-last_cmd-cmd_code == CCW_CMD_TIC) { +ret = -EINVAL; +break; +} +if (ccw-flags (CCW_FLAG_CC | CCW_FLAG_DC)) { +ret = -EINVAL; +break; +} +sch-channel_prog = qemu_get_ram_ptr(ccw-cda); +ret = sch-channel_prog ? -EAGAIN : -EFAULT; +break; +default: +if (sch-ccw_cb) { +/* Handle device specific commands. */ +ret = sch-ccw_cb(sch, ccw); +} else { +ret = -EOPNOTSUPP; +} +break; +} +sch-last_cmd = ccw; +if (ret == 0) { +if (ccw-flags CCW_FLAG_CC) { +sch-channel_prog += 8; +ret = -EAGAIN; +} +} + +return ret; diff --git a/hw/s390x/css.h b/hw/s390x/css.h new file mode 100644 index 000..b8a95cc --- /dev/null +++ b/hw/s390x/css.h @@ -0,0 +1,62 @@ +/* + * Channel subsystem structures and definitions. + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#ifndef CSS_H +#define CSS_H + +#include ioinst.h + +/* Channel subsystem constants. */ +#define MAX_SCHID 65535 +#define MAX_SSID 3 +#define MAX_CSSID 254 /* 255 is reserved */ +#define MAX_CHPID 255 + +#define MAX_CIWS 8 + +struct senseid { SenseID OK. +/* common part */ +uint32_t reserved:8;/* always 0x'FF' */ The standard syntax calls for 'unsigned' instead of uint32_t for bit fields. But bit fields are not very well defined, it's better to avoid them. Well, the equivalent Linux structure also looks like that :) But I can switch this to a uint8_t/uint16_t structure. +uint32_t cu_type:16;/* control unit type */ +uint32_t cu_model:8;/* control unit model */ +uint32_t dev_type:16; /* device type */ +uint32_t dev_model:8; /* device model */ +uint32_t unused:8; /* padding byte */ +/* extended part */ +uint32_t ciw[MAX_CIWS]; /* variable # of CIWs */ +}; + diff --git a/target-s390x/ioinst.h b/target-s390x/ioinst.h new file mode 100644 index 000..79628b4 --- /dev/null +++ b/target-s390x/ioinst.h @@ -0,0 +1,173 @@ +/* + * S/390 channel I/O instructions + * + * Copyright 2012 IBM Corp. + * Author(s): Cornelia Huck cornelia.h...@de.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or
Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating
On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using mem_map_lock to protect among updaters. So we can get the intact snapshot of mem topology -- FlatView radix-tree. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |3 +++ memory.c | 22 ++ memory.h |2 ++ 3 files changed, 27 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 8244d54..0e29ef9 100644 --- a/exec.c +++ b/exec.c @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; The bottom level has pointers to MemoryRegionSections. */ static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; +QemuMutex mem_map_lock; + static void io_mem_init(void); static void memory_map_init(void); @@ -637,6 +639,7 @@ void cpu_exec_init_all(void) #if !defined(CONFIG_USER_ONLY) memory_map_init(); io_mem_init(); +qemu_mutex_init(mem_map_lock); I'd move this and the mutex to memory.c since there are no other uses. The mutex could be static then. #endif } diff --git a/memory.c b/memory.c index aab4a31..5986532 100644 --- a/memory.c +++ b/memory.c @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void) assert(memory_region_transaction_depth); --memory_region_transaction_depth; if (!memory_region_transaction_depth memory_region_update_pending) { +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } } @@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client) { uint8_t mask = 1 client; +qemu_mutex_lock(mem_map_lock); mr-dirty_log_mask = (mr-dirty_log_mask ~mask) | (log * mask); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr, @@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr) void memory_region_set_readonly(MemoryRegion *mr, bool readonly) { if (mr-readonly != readonly) { +qemu_mutex_lock(mem_map_lock); mr-readonly = readonly; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1112,7 +1118,9 @@ void memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable) { if (mr-readable != readable) { mr-readable = readable; +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) { break; @@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i)); mr-ioeventfds[i] = mrfd; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } void memory_region_del_eventfd(MemoryRegion *mr, @@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) { break; @@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, mr-ioeventfds = g_realloc(mr-ioeventfds, sizeof(*mr-ioeventfds)*mr-ioeventfd_nb + 1); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } static void memory_region_add_subregion_common(MemoryRegion *mr, @@ -1259,6 +1271,8 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, assert(!subregion-parent); subregion-parent = mr; subregion-addr = offset; + +qemu_mutex_lock(mem_map_lock); QTAILQ_FOREACH(other, mr-subregions, subregions_link) { if (subregion-may_overlap || other-may_overlap) { continue; @@ -1289,6 +1303,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link); done: memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } @@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr, { assert(subregion-parent == mr); subregion-parent = NULL; + +qemu_mutex_lock(mem_map_lock); QTAILQ_REMOVE(mr-subregions, subregion, subregions_link); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } void memory_region_set_enabled(MemoryRegion *mr, bool enabled) @@ -1325,8 +1343,10 @@ void memory_region_set_enabled(MemoryRegion *mr,
Re: [PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy
On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com PhysMap contain the flatview and radix-tree view, they are snapshot of system topology and should be consistent. With PhysMap, we can swap the pointer when updating and achieve the atomic. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |8 memory.c | 33 - memory.h | 62 -- 3 files changed, 60 insertions(+), 43 deletions(-) diff --git a/exec.c b/exec.c index 0e29ef9..01b91b0 100644 --- a/exec.c +++ b/exec.c @@ -156,8 +156,6 @@ typedef struct PageDesc { #endif /* Size of the L2 (and L3, etc) page tables. */ Please copy this comment to the header file. -#define L2_BITS 10 -#define L2_SIZE (1 L2_BITS) #define P_L2_LEVELS \ (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1) @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask; static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -typedef struct PhysPageEntry PhysPageEntry; static MemoryRegionSection *phys_sections; static unsigned phys_sections_nb, phys_sections_nb_alloc; @@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; -struct PhysPageEntry { -uint16_t is_leaf : 1; - /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ -uint16_t ptr : 15; -}; /* Simple allocator for PhysPageEntry nodes */ static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; diff --git a/memory.c b/memory.c index 2eaa2fc..c7f2cfd 100644 --- a/memory.c +++ b/memory.c @@ -31,17 +31,6 @@ static bool global_dirty_log = false; static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners = QTAILQ_HEAD_INITIALIZER(memory_listeners); -typedef struct AddrRange AddrRange; - -/* - * Note using signed integers limits us to physical addresses at most - * 63 bits wide. They are needed for negative offsetting in aliases - * (large MemoryRegion::alias_offset). - */ -struct AddrRange { -Int128 start; -Int128 size; -}; static AddrRange addrrange_make(Int128 start, Int128 size) { @@ -197,28 +186,6 @@ static bool memory_region_ioeventfd_equal(MemoryRegionIoeventfd a, !memory_region_ioeventfd_before(b, a); } -typedef struct FlatRange FlatRange; -typedef struct FlatView FlatView; - -/* Range of memory in the global map. Addresses are absolute. */ -struct FlatRange { -MemoryRegion *mr; -target_phys_addr_t offset_in_region; -AddrRange addr; -uint8_t dirty_log_mask; -bool readable; -bool readonly; -}; - -/* Flattened global view of current active memory hierarchy. Kept in sorted - * order. - */ -struct FlatView { -FlatRange *ranges; -unsigned nr; -unsigned nr_allocated; -}; - typedef struct AddressSpace AddressSpace; typedef struct AddressSpaceOps AddressSpaceOps; diff --git a/memory.h b/memory.h index 740f018..357edd8 100644 --- a/memory.h +++ b/memory.h @@ -29,12 +29,72 @@ #include qemu-thread.h #include qemu/reclaimer.h +typedef struct AddrRange AddrRange; +typedef struct FlatRange FlatRange; +typedef struct FlatView FlatView; +typedef struct PhysPageEntry PhysPageEntry; +typedef struct PhysMap PhysMap; +typedef struct MemoryRegionSection MemoryRegionSection; typedef struct MemoryRegionOps MemoryRegionOps; typedef struct MemoryRegionLifeOps MemoryRegionLifeOps; typedef struct MemoryRegion MemoryRegion; typedef struct MemoryRegionPortio MemoryRegionPortio; typedef struct MemoryRegionMmio MemoryRegionMmio; +/* + * Note using signed integers limits us to physical addresses at most + * 63 bits wide. They are needed for negative offsetting in aliases + * (large MemoryRegion::alias_offset). + */ +struct AddrRange { +Int128 start; +Int128 size; +}; + +/* Range of memory in the global map. Addresses are absolute. */ +struct FlatRange { +MemoryRegion *mr; +target_phys_addr_t offset_in_region; +AddrRange addr; +uint8_t dirty_log_mask; +bool readable; +bool readonly; +}; + +/* Flattened global view of current active memory hierarchy. Kept in sorted + * order. + */ +struct FlatView { +FlatRange *ranges; +unsigned nr; +unsigned nr_allocated; +}; + +struct PhysPageEntry { +uint16_t is_leaf:1; + /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ +uint16_t ptr:15; +}; + +#define L2_BITS 10 +#define L2_SIZE (1 L2_BITS) +/* This is a multi-level map on the physical address space. + The bottom level has pointers to MemoryRegionSections. */ +struct PhysMap { +Atomic ref; +PhysPageEntry root; +PhysPageEntry (*phys_map_nodes)[L2_SIZE]; +unsigned phys_map_nodes_nb; +unsigned
Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access
On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Flatview and radix view are all under the protection of pointer. And this make sure the change of them seem to be atomic! The mr accessed by radix-tree leaf or flatview will be reclaimed after the prev PhysMap not in use any longer Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c | 303 +++--- hw/vhost.c |2 +- hw/xen_pt.c |2 +- kvm-all.c |2 +- memory.c| 92 ++- memory.h|9 ++- vl.c|1 + xen-all.c |2 +- 8 files changed, 286 insertions(+), 127 deletions(-) diff --git a/exec.c b/exec.c index 01b91b0..97addb9 100644 --- a/exec.c +++ b/exec.c @@ -24,6 +24,7 @@ #include sys/mman.h #endif +#include qemu/atomic.h #include qemu-common.h #include cpu.h #include tcg.h @@ -35,6 +36,8 @@ #include qemu-timer.h #include memory.h #include exec-memory.h +#include qemu-thread.h +#include qemu/reclaimer.h #if defined(CONFIG_USER_ONLY) #include qemu.h #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -static MemoryRegionSection *phys_sections; -static unsigned phys_sections_nb, phys_sections_nb_alloc; static uint16_t phys_section_unassigned; static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; - -/* Simple allocator for PhysPageEntry nodes */ -static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; -static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; - #define PHYS_MAP_NODE_NIL (((uint16_t)~0) 1) -/* This is a multi-level map on the physical address space. - The bottom level has pointers to MemoryRegionSections. */ -static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; - +static QemuMutex cur_map_lock; +static PhysMap *cur_map; QemuMutex mem_map_lock; +static PhysMap *next_map; static void io_mem_init(void); static void memory_map_init(void); @@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index) #if !defined(CONFIG_USER_ONLY) -static void phys_map_node_reserve(unsigned nodes) +static void phys_map_node_reserve(PhysMap *map, unsigned nodes) { -if (phys_map_nodes_nb + nodes phys_map_nodes_nb_alloc) { +if (map-phys_map_nodes_nb + nodes map-phys_map_nodes_nb_alloc) { typedef PhysPageEntry Node[L2_SIZE]; -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16); -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc, - phys_map_nodes_nb + nodes); -phys_map_nodes = g_renew(Node, phys_map_nodes, - phys_map_nodes_nb_alloc); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2, +16); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc, + map-phys_map_nodes_nb + nodes); +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes, + map-phys_map_nodes_nb_alloc); } } -static uint16_t phys_map_node_alloc(void) +static uint16_t phys_map_node_alloc(PhysMap *map) { unsigned i; uint16_t ret; -ret = phys_map_nodes_nb++; +ret = map-phys_map_nodes_nb++; assert(ret != PHYS_MAP_NODE_NIL); -assert(ret != phys_map_nodes_nb_alloc); +assert(ret != map-phys_map_nodes_nb_alloc); for (i = 0; i L2_SIZE; ++i) { -phys_map_nodes[ret][i].is_leaf = 0; -phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; +map-phys_map_nodes[ret][i].is_leaf = 0; +map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; } return ret; } -static void phys_map_nodes_reset(void) -{ -phys_map_nodes_nb = 0; -} - - -static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, -target_phys_addr_t *nb, uint16_t leaf, +static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp, +target_phys_addr_t *index, +target_phys_addr_t *nb, +uint16_t leaf, int level) { PhysPageEntry *p; @@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, target_phys_addr_t step = (target_phys_addr_t)1 (level * L2_BITS); if (!lp-is_leaf lp-ptr == PHYS_MAP_NODE_NIL) { -lp-ptr = phys_map_node_alloc(); -p = phys_map_nodes[lp-ptr]; +lp-ptr = phys_map_node_alloc(map); +p = map-phys_map_nodes[lp-ptr]; if (level == 0) { for (i = 0; i
Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.
On 8 August 2012 20:16, Blue Swirl blauwir...@gmail.com wrote: On Wed, Aug 8, 2012 at 8:17 AM, Cornelia Huck cornelia.h...@de.ibm.com wrote: On Tue, 7 Aug 2012 21:00:59 + Blue Swirl blauwir...@gmail.com wrote: Please use more descriptive names instead of acronyms, for example SubChStatus. I'd rather leave these at the well-known scsw, pmcw, etc. names. These have been around for decades, and somebody familiar with channel I/O will instantly know what a struct scsw is, but will need to look hard at the code to figure out the meaning of SubChStatus. If they are well-known and have been around for so long time, are there any suitable header files (with compatible licenses) where they are defined which could be reused? Otherwise, please follow CODING_STYLE. I think we should follow CODING_STYLE for capitalisation issues but generally if the device's documentation has standard abbreviations for register names, structures, etc, etc we should use them. Often this code has to be maintained later by somebody else who might not be familiar with the general operation of the hardware and who is trying to match up the code with whatever the data sheet says. Following the naming used in the h/w docs makes that job easier. (for instance I took the opportunity of making a bunch of structure member names in target-arm line up with the ARM ARM names as part of the refactoring that went on a while back.) -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/5] s390: Virtual channel subsystem support.
On Wed, Aug 8, 2012 at 7:34 PM, Peter Maydell peter.mayd...@linaro.org wrote: On 8 August 2012 20:16, Blue Swirl blauwir...@gmail.com wrote: On Wed, Aug 8, 2012 at 8:17 AM, Cornelia Huck cornelia.h...@de.ibm.com wrote: On Tue, 7 Aug 2012 21:00:59 + Blue Swirl blauwir...@gmail.com wrote: Please use more descriptive names instead of acronyms, for example SubChStatus. I'd rather leave these at the well-known scsw, pmcw, etc. names. These have been around for decades, and somebody familiar with channel I/O will instantly know what a struct scsw is, but will need to look hard at the code to figure out the meaning of SubChStatus. If they are well-known and have been around for so long time, are there any suitable header files (with compatible licenses) where they are defined which could be reused? Otherwise, please follow CODING_STYLE. I think we should follow CODING_STYLE for capitalisation issues but generally if the device's documentation has standard abbreviations for register names, structures, etc, etc we should use them. Often this code has to be maintained later by somebody else who might not be familiar with the general operation of the hardware and who is trying to match up the code with whatever the data sheet says. Following the naming used in the h/w docs makes that job easier. Yes. typedef struct SCSW {} SCSW; should be OK too. (for instance I took the opportunity of making a bunch of structure member names in target-arm line up with the ARM ARM names as part of the refactoring that went on a while back.) -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UIO: missing resource mapping
On Wed, Jul 18, 2012 at 12:40:47PM +0200, Dominic Eschweiler wrote: Am Montag, den 16.07.2012, 23:58 +0200 schrieb Hans J. Koch: Try to hack up a patch to add generic BAR mapping to uio_pci_generic.c and post it for review. Here we go ... Thank you very much for your work. I'm really sorry for the long delay, but I was busy finishing a project because I go to vacation tomorrow. Sorry, that might cause further delay since I don't know yet how often I can read my mail... Greg, can you review the next one? Here's a first review. Thanks, Hans Signed-off-by: Dominic Eschweiler eschwei...@fias.uni-frankfurt.de diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c index 0bd08ef..e25991e 100644 --- a/drivers/uio/uio_pci_generic.c +++ b/drivers/uio/uio_pci_generic.c @@ -25,10 +25,12 @@ #include linux/slab.h #include linux/uio_driver.h -#define DRIVER_VERSION 0.01.0 +#define DRIVER_VERSION 0.02.0 #define DRIVER_AUTHORMichael S. Tsirkin m...@redhat.com #define DRIVER_DESC Generic UIO driver for PCI 2.3 devices +#define DRV_NAME uio_pci_generic + struct uio_pci_generic_dev { struct uio_info info; struct pci_dev *pdev; @@ -58,6 +60,7 @@ static int __devinit probe(struct pci_dev *pdev, { struct uio_pci_generic_dev *gdev; int err; + int i; err = pci_enable_device(pdev); if (err) { @@ -67,8 +70,7 @@ static int __devinit probe(struct pci_dev *pdev, } if (!pdev-irq) { - dev_warn(pdev-dev, No IRQ assigned to device: - no support for interrupts?\n); + dev_warn(pdev-dev, No IRQ assigned to device: no support for interrupts?\n); Please configure your mail client not to break lines when sending a patch. It can't be applied like this. Why did you make that change anyway? If it's just coding style, please send another patch, don't mix functional changes with coding style fixes. pci_disable_device(pdev); return -ENODEV; } @@ -91,10 +93,31 @@ static int __devinit probe(struct pci_dev *pdev, gdev-info.handler = irqhandler; gdev-pdev = pdev; + /* request regions */ + err = pci_request_regions(pdev, DRV_NAME); + if (err) { + dev_err(pdev-dev, Couldn't get PCI resources, aborting\n); + return err; + } + + /* create attributes for BAR mappings */ + for (i = 0; i PCI_NUM_RESOURCES; i++) { + if (pdev-resource[i].flags + (pdev-resource[i].flags IORESOURCE_MEM)) { + gdev-info.mem[i].addr = pci_resource_start(pdev, i); + gdev-info.mem[i].size = pci_resource_len(pdev, i); + gdev-info.mem[i].internal_addr = NULL; + gdev-info.mem[i].memtype = UIO_MEM_PHYS; + } + } + if (uio_register_device(pdev-dev, gdev-info)) goto err_register; pci_set_drvdata(pdev, gdev); + pr_info(UIO_PCI_GENERIC : initialized new device (%x %x)\n, Please use dev_info() + pdev-vendor, pdev-device); + return 0; err_register: kfree(gdev); @@ -107,17 +130,21 @@ err_verify: static void remove(struct pci_dev *pdev) { struct uio_pci_generic_dev *gdev = pci_get_drvdata(pdev); - uio_unregister_device(gdev-info); + + pci_release_regions(pdev); pci_disable_device(pdev); kfree(gdev); + + pr_info(UIO_PCI_GENERIC : removed device (%x %x)\n, ditto + pdev-vendor, pdev-device); } static struct pci_driver driver = { - .name = uio_pci_generic, + .name = DRV_NAME, .id_table = NULL, /* only dynamic id's */ - .probe = probe, - .remove = remove, + .probe= probe, + .remove = remove, As above: Please put coding style fixes in an extra patch (if you really insist on tabs instead of spaces...) }; static int __init init(void) -- Gruß Dominic Frankfurt Institute for Advanced Studies (FIAS) Ruth-Moufang-Straße 1 D-60438 Frankfurt am Main Germany Phone: +49 69 79844114 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm tools: Add initial virtio-scsi support
This patch brings virito-scsi support to kvm tool. With the introduce of tcm_vhost (vhost-scsi) tcm_vhost: Initial merge for vhost level target fabric driver we can implement virito-scsi by simply having vhost-scsi to handle the SCSI command. Howto use: 1) Setup the tcm_vhost target through /sys/kernel/config [Stefan Hajnoczi, Thanks for the script to setup tcm_vhost] ** Setup wwpn and tpgt $ wwpn=naa.0 $ tpgt=/sys/kernel/config/target/vhost/$wwpn/tpgt_0 $ nexus=$tpgt/nexus $ mkdir -p $tpgt $ echo -n $wwpn $nexus ** Setup lun using /dev/ram $ n=0 $ lun=$tpgt/lun/lun_${n} $ data=/sys/kernel/config/target/core/iblock_0/data_${n} $ ram=/dev/ram${n} $ mkdir -p $lun $ mkdir -p $data $ echo -n udev_path=${ram} $data/control $ echo -n 1 $data/enable $ ln -s $data $lun 2) Run kvm tool with the new disk option '-d scsi:$wwpn:$tpgt', e.g $ lkvm run -k /boot/bzImage -d ~/img/sid.img -d scsi:naa.0:0 Signed-off-by: Asias He asias.he...@gmail.com Cc: Nicholas A. Bellinger n...@linux-iscsi.org Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com Cc: Paolo Bonzini pbonz...@redhat.com --- tools/kvm/Makefile | 1 + tools/kvm/builtin-run.c| 25 +++ tools/kvm/disk/core.c | 14 ++ tools/kvm/include/kvm/disk-image.h | 4 + tools/kvm/include/kvm/virtio-pci-dev.h | 1 + tools/kvm/include/kvm/virtio-scsi.h| 11 ++ tools/kvm/include/linux/compiler.h | 1 + tools/kvm/virtio/blk.c | 2 + tools/kvm/virtio/scsi.c| 332 + 9 files changed, 391 insertions(+) create mode 100644 tools/kvm/include/kvm/virtio-scsi.h create mode 100644 tools/kvm/virtio/scsi.c diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile index f9e1ec1..749956a 100644 --- a/tools/kvm/Makefile +++ b/tools/kvm/Makefile @@ -55,6 +55,7 @@ OBJS += mmio.o OBJS += pci.o OBJS += term.o OBJS += virtio/blk.o +OBJS += virtio/scsi.o OBJS += virtio/console.o OBJS += virtio/core.o OBJS += virtio/net.o diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c index a36bd00..c3ec469 100644 --- a/tools/kvm/builtin-run.c +++ b/tools/kvm/builtin-run.c @@ -8,6 +8,7 @@ #include kvm/framebuffer.h #include kvm/disk-image.h #include kvm/threadpool.h +#include kvm/virtio-scsi.h #include kvm/virtio-blk.h #include kvm/virtio-net.h #include kvm/virtio-rng.h @@ -171,6 +172,19 @@ static int img_name_parser(const struct option *opt, const char *arg, int unset) disk_image[image_count].filename = arg; cur = arg; + + if (strncmp(arg, scsi:, 5) == 0) { + sep = strstr(arg, :); + if (sep) + disk_image[image_count].wwpn = sep + 1; + sep = strstr(sep + 1, :); + if (sep) { + *sep = 0; + disk_image[image_count].tpgt = sep + 1; + } + cur = sep + 1; + } + do { sep = strstr(cur, ,); if (sep) { @@ -1184,6 +1198,13 @@ static int kvm_cmd_run_init(int argc, const char **argv) goto fail; } + r = virtio_scsi_init(kvm); + if (r 0) { + pr_err(virtio_scsi_init() failed with error %d\n, r); + goto fail; + } + + if (active_console == CONSOLE_VIRTIO) virtio_console__init(kvm); @@ -1332,6 +1353,10 @@ static void kvm_cmd_run_exit(int guest_ret) fb__stop(); + r = virtio_scsi_exit(kvm); + if (r 0) + pr_warning(virtio_scsi_exit() failed with error %d\n, r); + r = virtio_blk__exit(kvm); if (r 0) pr_warning(virtio_blk__exit() failed with error %d\n, r); diff --git a/tools/kvm/disk/core.c b/tools/kvm/disk/core.c index 621c940..f7e2c7f 100644 --- a/tools/kvm/disk/core.c +++ b/tools/kvm/disk/core.c @@ -122,6 +122,8 @@ struct disk_image **disk_image__open_all(struct disk_image_params *params, int c { struct disk_image **disks; const char *filename; + const char *wwpn; + const char *tpgt; bool readonly; bool direct; void *err; @@ -140,6 +142,18 @@ struct disk_image **disk_image__open_all(struct disk_image_params *params, int c filename = params[i].filename; readonly = params[i].readonly; direct = params[i].direct; + wwpn = params[i].wwpn; + tpgt = params[i].tpgt; + + if (wwpn) { + disks[i] = malloc(sizeof(struct disk_image)); + if (!disks[i]) + return ERR_PTR(-ENOMEM); + disks[i]-wwpn = wwpn; + disks[i]-tpgt = tpgt; + continue; + } + if (!filename) continue; diff --git
Re: [PATCH 6/8] KVM: PPC: E500: Implement MMU notifiers
On Tue, Aug 07, 2012 at 12:57:14PM +0200, Alexander Graf wrote: The e500 target has lived without mmu notifiers ever since it got introduced, but fails for the user space check on them with hugetlbfs. Ironically that user space check isn't necessary any more since David Gibson's fix for the hugetlbfs bug went in (90481622, hugepages: fix use after free bug in quota handling). So on sufficiently recent kernels you can just remove the userspace check. Implementing mmu-notifiers is a good thing for other reasons though. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: PPC: Add cache flush on page map
On 07.08.2012, at 23:01, Scott Wood scottw...@freescale.com wrote: On 08/07/2012 05:57 AM, Alexander Graf wrote: When we map a page that wasn't icache cleared before, do so when first mapping it in KVM using the same information bits as the Linux mapping logic. That way we are 100% sure that any page we map does not have stale entries in the icache. What we really would need is a method to flush the icache only when we actually map a page executable for the guest. But with the current abstraction that is hard to do, and this way we don't have a huge performance penalty, so better be safe now and micro optimize things later. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/mm/mem.c |1 + 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ed75bc9..c0a2fc1 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -33,6 +33,7 @@ #include asm/kvm_asm.h #include asm/processor.h #include asm/page.h +#include asm/cacheflush.h #define KVM_MAX_VCPUSNR_CPUS #define KVM_MAX_VCORESNR_CPUS @@ -550,5 +551,14 @@ struct kvm_vcpu_arch { #define KVM_MMIO_REG_FQPR0x0060 #define __KVM_HAVE_ARCH_WQP +#define __KVM_HAVE_ARCH_MAP_PAGE +static inline void kvm_arch_map_page(struct page *page) +{ +/* Need to invalidate the icache for new pages */ +if (!test_bit(PG_arch_1, page-flags)) { +flush_dcache_icache_page(page); +set_bit(PG_arch_1, page-flags); +} +} Shouldn't this test CPU_FTR_COHERENT_ICACHE? flush_icache_range checks for it a bit further down the code stream. I mostly modeled things the same way as set_pre_filter. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote: On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote: Peter has started to fix up this naming mess in qemu. I guess we should do the same for the kernel (except for ABIs) and document it, because it keeps generating confusion. Ok so our current situation is that the XICS and MPIC are emulated in userspace entirely but the link between them and the VCPU is the asynchronous EE line so we are fine. I'm currently working on moving the XICS into the kernel for performance reasons, however, just like ARM VGIC, I can't seem to find a way to fit in the generic irqchip code in there. It's just not generic at all and quite x86 centric :-) The generic code is for the two apic architectures: x64 and ia64. Don't try to shoehorn it in, you'll damage both the shoe and your foot. So for now I'm just doing my own version of CREATE_IRQCHIP to create it and KVM_INTERRUPT to trigger the various interrupts. None of the mapping stuff (which we really don't need). You mean KVM_IRQ_LINE. KVM_INTERRUPT is a synchronous vcpu ioctl. That's a bit of a problem vs. some of the code qemu-side such as in virtio-pci which does seem to be written around the model exposed by the x86 stuff and relies on doing such mappings so I think we'll have to butcher some of that. Can you elaborate? virtio-pci is pci-centric, there should be nothing x86 specific there. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On 08/08/2012 03:49 AM, David Gibson wrote: We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way. On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and the two PICs are emulated in the kernel. Now the IOAPIC and the PICs correspond to non-x86 interrupt controllers, but the local APIC is more tightly coupled to the core. Interrupt acceptance by the core is an operation that involved synchronous communication with the local APIC: the APIC presents the interrupt, the core accepts it based on the value of the interrupt enable flag and possible a register (CR8), then the APIC updates the ISR and IRR. The upshot is that if the local APIC is in userspace, interrupts must be synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl and HLT is emulated in userspace (so that local APIC emulation can check if an interrupt wakes it up or not). Sorry, still not 100% getting it. When the vcpu is actually running code, that synchronous communication must still be accomplished via the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the communication can't be accomplished in that case. No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote: So for now I'm just doing my own version of CREATE_IRQCHIP to create it and KVM_INTERRUPT to trigger the various interrupts. None of the mapping stuff (which we really don't need). You mean KVM_IRQ_LINE. KVM_INTERRUPT is a synchronous vcpu ioctl. Yes, sorry, brain and fingers not agreeing :-) That's a bit of a problem vs. some of the code qemu-side such as in virtio-pci which does seem to be written around the model exposed by the x86 stuff and relies on doing such mappings so I think we'll have to butcher some of that. Can you elaborate? virtio-pci is pci-centric, there should be nothing x86 specific there. The kvm_irqchip_add_msi_route co is a problem, it more/less dives into the x86/ia64 specific routing stuff at a generic level, I'll have to hook things up differently in qemu I suspect. Not a huge deal, we can sort it out. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation
On 07.08.2012, at 17:59, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, August 07, 2012 3:38 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Bhushan Bharat-R65777 Subject: Re: [PATCH 1/2] KVM: PPC: booke: Add watchdog emulation On 03.08.2012, at 07:30, Bharat Bhushan wrote: This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_BOOKE_WATCHDOG) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com [bharat.bhus...@freescale.com: reworked patch] Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v7: - kvmppc_core_dequeue_watchdog() and kvmppc_core_enque_watchdog() are made staic - Clear the KVM_REQ_WATCHDOG request when TSR_ENW | ENW_WIS cleared rather than checking these bits on handling the request. Below changes (pasted for quick understanding) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 7eedaeb..a0f5922 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -629,8 +629,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) goto out; } - if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) - (vcpu-arch.tsr (TSR_ENW | TSR_WIS))) { + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) { kvm_run-exit_reason = KVM_EXIT_WATCHDOG; ret = 0; goto out; @@ -1098,8 +1097,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, } } - if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) - (vcpu-arch.tsr (TSR_ENW | TSR_WIS))) { + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)) { run-exit_reason = KVM_EXIT_WATCHDOG; r = RESUME_HOST | (r RESUME_FLAG_NV); } @@ -1251,8 +1249,10 @@ static int set_sregs_base(struct kvm_vcpu *vcpu, vcpu-arch.tsr = sregs-u.e.tsr; - if ((old_tsr ^ vcpu-arch.tsr) (TSR_ENW | TSR_WIS)) + if ((old_tsr ^ vcpu-arch.tsr) (TSR_ENW | TSR_WIS)) { + clear_bit(KVM_REQ_WATCHDOG, vcpu-requests); arm_next_watchdog(vcpu); + } update_timer_ints(vcpu); } @@ -1434,8 +1434,10 @@ void kvmppc_clr_tsr_bits(struct kvm_vcpu *vcpu, u32 tsr_bits) * We may have stopped the watchdog due to * being stuck on final expiration. */ - if (tsr_bits (TSR_ENW | TSR_WIS)) + if (tsr_bits (TSR_ENW | TSR_WIS)) { + clear_bit(KVM_REQ_WATCHDOG, vcpu-requests); arm_next_watchdog(vcpu); + } -- v6: - Added kvmppc_subarch_vcpu_unit/uninit() for subarch specific initialization - Using spin_lock_irqsave() - Using del_timer_sync(). v5: - Checking that TSR_ENW/TSR_WIS are still set on KVM_EXIT_WATCHDOG. - Moved watchdog_next_timeout() in lock. arch/powerpc/include/asm/kvm_host.h |3 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/asm/reg_booke.h |7 ++ arch/powerpc/kvm/book3s.c|9 ++ arch/powerpc/kvm/booke.c | 158 ++ arch/powerpc/kvm/booke_emulate.c |8 ++ arch/powerpc/kvm/powerpc.c | 14 +++- include/linux/kvm.h |2 + include/linux/kvm_host.h |1 + 9 files changed, 202 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 572ad01..873ec11 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -469,6 +469,8 @@ struct kvm_vcpu_arch { ulong fault_esr; ulong queued_dear; ulong queued_esr; + spinlock_t wdt_lock; + struct timer_list wdt_timer; u32 tlbcfg[4]; u32 mmucfg; u32 epr; @@ -484,6 +486,7 @@ struct kvm_vcpu_arch { u8 osi_needed; u8 osi_enabled; u8 papr_enabled; + u8 watchdog_enabled; u8 sane; u8 cpu_type; u8 hcall_needed; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0124937..1438a5e 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -68,6 +68,8 @@ extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu); extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb); extern void kvmppc_decrementer_func(unsigned long data); extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu); +extern int kvmppc_subarch_vcpu_init(struct kvm_vcpu *vcpu); +extern void
Re: [PATCH 3/3] KVM: PPC: booke: Added debug handler
On 08.08.2012, at 03:02, Bhushan Bharat-R65777 wrote: -Original Message- From: Wood Scott-B07421 Sent: Wednesday, August 08, 2012 2:15 AM To: Alexander Graf Cc: Bhushan Bharat-R65777; kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Bhushan Bharat-R65777 Subject: Re: [PATCH 3/3] KVM: PPC: booke: Added debug handler On 08/07/2012 05:47 AM, Alexander Graf wrote: diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S index 3539805..890673c 100644 --- a/arch/powerpc/kvm/booke_interrupts.S +++ b/arch/powerpc/kvm/booke_interrupts.S @@ -73,6 +73,51 @@ _GLOBAL(kvmppc_handler_\ivor_nr) bctr .endm +.macro KVM_DBG_HANDLER ivor_nr scratch srr0 This is a lot of asm code. Any chance to share a good share of it with the generic handler? That entire file could use an update to lok more like bookehv_interrupts.S and its use of asm macros. In booke there is assumption that size of KVM IVORs will not me more than host IVORs size so that only IVPR is changed. I tried to give it that shape of bookehv_interrupts.S and found that size of some IVORs become more than host IVORs. We can always jump off to another (more generic?) function and only have a small stub in the IVOR referenced code. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote: On 08/08/2012 03:49 AM, David Gibson wrote: We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way. On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and the two PICs are emulated in the kernel. Now the IOAPIC and the PICs correspond to non-x86 interrupt controllers, but the local APIC is more tightly coupled to the core. Interrupt acceptance by the core is an operation that involved synchronous communication with the local APIC: the APIC presents the interrupt, the core accepts it based on the value of the interrupt enable flag and possible a register (CR8), then the APIC updates the ISR and IRR. The upshot is that if the local APIC is in userspace, interrupts must be synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl and HLT is emulated in userspace (so that local APIC emulation can check if an interrupt wakes it up or not). Sorry, still not 100% getting it. When the vcpu is actually running code, that synchronous communication must still be accomplished via the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the communication can't be accomplished in that case. No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel. Aha! Ok, understood. Uh, assuming you meant kernelspace, not userspace in the first line there, anyway. Ok, so I am now reassured that our current handling of CEDE in kernelspace does not cause problems. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On 08/08/2012 02:59 PM, David Gibson wrote: No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel. Aha! Ok, understood. Uh, assuming you meant kernelspace, not userspace in the first line there, anyway. I did. Ok, so I am now reassured that our current handling of CEDE in kernelspace does not cause problems. Great. It's a real pity the original local APIC implementation was in userspace, it causes no end of confusion, and in fact was broken for smp until recently even though it is 7 years old. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] KVM: PPC: E500: Implement MMU Notifiers
On 07.08.2012, at 12:57, Alexander Graf wrote: This patch set adds a very simple implementation of MMU notifiers for the e500 target. Along the way, I stumbled over a few other things that I've put into the same patch set, namely: * e500 prerequisites * icache flushing on page map (probably also hits ARM!) * exit trace point for e500 Alexander Graf (8): KVM: PPC: BookE: Expose remote TLB flushes in debugfs KVM: PPC: E500: Fix clear_tlb_refs KVM: PPC: PR: Use generic tracepoint for guest exit KVM: PPC: Expose SYNC cap based on mmu notifiers I applied the above 4 patches to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html